jsaizant commited on
Commit
6d3a269
·
verified ·
1 Parent(s): 4ea3632

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -60,7 +60,7 @@ Along with the open weights, all training scripts and configuration files are ma
60
 
61
  ### Description
62
 
63
- Transformer-based decoder-only language model that has been pre-trained from scratch on 11.675 trillion tokens of highly curated data.
64
  The pre-training corpus contains text in 35 European languages and code.
65
 
66
  ### Hyperparameters
@@ -283,7 +283,7 @@ The initial three training epochs used 2.4 trillion tokens, obtained by manually
283
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
284
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
285
  Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
286
- This adjustment resulted in a total of 2.08 trillion tokens, distributed as outlined below:
287
 
288
  ![lang distrib](./images/corpus_languages.png)
289
 
@@ -430,8 +430,8 @@ To consult the data summary document with the respective licences, please send a
430
  </details>
431
 
432
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
433
- of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.08T tokens per epoch;
434
- and 1 final round of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 11.675 trillion tokens.
435
 
436
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
437
 
@@ -691,6 +691,7 @@ The dataset does not allow for external contributions.
691
 
692
  ---
693
 
 
694
  ## Evaluation
695
 
696
  ### Gold-standard benchmarks
 
60
 
61
  ### Description
62
 
63
+ Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data.
64
  The pre-training corpus contains text in 35 European languages and code.
65
 
66
  ### Hyperparameters
 
283
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
284
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
285
  Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
286
+ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
287
 
288
  ![lang distrib](./images/corpus_languages.png)
289
 
 
430
  </details>
431
 
432
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
433
+ of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
434
+ and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
435
 
436
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
437
 
 
691
 
692
  ---
693
 
694
+
695
  ## Evaluation
696
 
697
  ### Gold-standard benchmarks