Update README.md
Browse files
README.md
CHANGED
@@ -60,7 +60,7 @@ Along with the open weights, all training scripts and configuration files are ma
|
|
60 |
|
61 |
### Description
|
62 |
|
63 |
-
Transformer-based decoder-only language model that has been pre-trained from scratch on
|
64 |
The pre-training corpus contains text in 35 European languages and code.
|
65 |
|
66 |
### Hyperparameters
|
@@ -283,7 +283,7 @@ The initial three training epochs used 2.4 trillion tokens, obtained by manually
|
|
283 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
284 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
285 |
Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
286 |
-
This adjustment resulted in a total of 2.
|
287 |
|
288 |

|
289 |
|
@@ -430,8 +430,8 @@ To consult the data summary document with the respective licences, please send a
|
|
430 |
</details>
|
431 |
|
432 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
433 |
-
of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.
|
434 |
-
and 1 final
|
435 |
|
436 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
437 |
|
@@ -691,6 +691,7 @@ The dataset does not allow for external contributions.
|
|
691 |
|
692 |
---
|
693 |
|
|
|
694 |
## Evaluation
|
695 |
|
696 |
### Gold-standard benchmarks
|
|
|
60 |
|
61 |
### Description
|
62 |
|
63 |
+
Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data.
|
64 |
The pre-training corpus contains text in 35 European languages and code.
|
65 |
|
66 |
### Hyperparameters
|
|
|
283 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
284 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
285 |
Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
286 |
+
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
287 |
|
288 |

|
289 |
|
|
|
430 |
</details>
|
431 |
|
432 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
433 |
+
of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
|
434 |
+
and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
|
435 |
|
436 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
437 |
|
|
|
691 |
|
692 |
---
|
693 |
|
694 |
+
|
695 |
## Evaluation
|
696 |
|
697 |
### Gold-standard benchmarks
|