Update README.md
Browse files
README.md
CHANGED
@@ -106,18 +106,19 @@ vocabulary size of 50,257. The inputs are sequences of 512 consecutive tokens.
|
|
106 |
|
107 |
### Pretraining
|
108 |
|
109 |
-
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 300k steps. The optimizer used was a second-order optimization method called [Distributed Shampoo](https://github.com/google-research/google-research/tree/master/scalable_shampoo) with learning rate 1e-4, learning rate warmup for 4000 steps and cosine decay of the learning rate after.
|
110 |
|
111 |
At first, commonly used Adam optimizer was tried but there were significant issues getting the model to converge even with multiple different learning rate trials so then Adam optimizer was replaced with the Distributed Shampoo which worked a lot better.
|
112 |
|
113 |
## Evaluation results
|
114 |
|
115 |
-
Evaluation was done using the *validation* split of the [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned) dataset with [Perplexity](https://huggingface.co/course/chapter7/3#perplexity-for-language-models) (smaller score the better) as the evaluation metric. As seen from the table below, this model (the first row of the table) loses to our bigger
|
116 |
|
117 |
| | Perplexity |
|
118 |
|------------------------------------------|------------|
|
119 |
|Finnish-NLP/gpt2-finnish |44.19 |
|
120 |
-
|Finnish-NLP/gpt2-medium-finnish
|
|
|
121 |
|
122 |
## Team Members
|
123 |
|
|
|
106 |
|
107 |
### Pretraining
|
108 |
|
109 |
+
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 300k steps (a bit over 2 epochs, 256 batch size). The optimizer used was a second-order optimization method called [Distributed Shampoo](https://github.com/google-research/google-research/tree/master/scalable_shampoo) with learning rate 1e-4, learning rate warmup for 4000 steps and cosine decay of the learning rate after.
|
110 |
|
111 |
At first, commonly used Adam optimizer was tried but there were significant issues getting the model to converge even with multiple different learning rate trials so then Adam optimizer was replaced with the Distributed Shampoo which worked a lot better.
|
112 |
|
113 |
## Evaluation results
|
114 |
|
115 |
+
Evaluation was done using the *validation* split of the [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned) dataset with [Perplexity](https://huggingface.co/course/chapter7/3#perplexity-for-language-models) (smaller score the better) as the evaluation metric. As seen from the table below, this model (the first row of the table) loses to our bigger model variants.
|
116 |
|
117 |
| | Perplexity |
|
118 |
|------------------------------------------|------------|
|
119 |
|Finnish-NLP/gpt2-finnish |44.19 |
|
120 |
+
|Finnish-NLP/gpt2-medium-finnish |34.08 |
|
121 |
+
|Finnish-NLP/gpt2-large-finnish |**30.74** |
|
122 |
|
123 |
## Team Members
|
124 |
|