loubnabnl HF staff commited on
Commit
bb71039
·
verified ·
1 Parent(s): b8c1842

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -21,7 +21,7 @@ This is a 1.8B model trained on [Cosmopedia](https://huggingface.co/datasets/Hug
21
 
22
  # Training dataset
23
  The training corpus consisted of 30B tokens, 25B of which are synthetic from Cosmopedia. Since we didn't explore the synthetic generation of code, we augmented the dataset with 5B tokens of non-synthetic sources like the `code-python-0.60-to-1.00` and `web-0.50-to-1.00` subsets of [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText). We also added 1M files from [The Stack](https://huggingface.co/datasets/bigcode/the-stack)'s Jupyter Notebooks, converted to script. They tend to have educational code interleaved with text.
24
- We also included [ultrachat](https://huggingface.co/datasets/stingning/ultrachat) formatted in the chat format of `LlaMa` models, so we don't have to instruction-tune the model after the pre-training. Additionally, we upsampled twice the data from these seed sources twice to help with commonsense and reasoning: stories, AutoMathText & KhanAcademy.
25
 
26
  We trained for 6 epochs, resulting in a model trained on 180B tokens with a sequence length of 2k, a global batch size of 1.3M tokens and a learning rate of 3e-4 with a cosine schedule for 140k steps.
27
  We used the tokenizer from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1/).
 
21
 
22
  # Training dataset
23
  The training corpus consisted of 30B tokens, 25B of which are synthetic from Cosmopedia. Since we didn't explore the synthetic generation of code, we augmented the dataset with 5B tokens of non-synthetic sources like the `code-python-0.60-to-1.00` and `web-0.50-to-1.00` subsets of [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText). We also added 1M files from [The Stack](https://huggingface.co/datasets/bigcode/the-stack)'s Jupyter Notebooks, converted to script. They tend to have educational code interleaved with text.
24
+ We also included [ultrachat](https://huggingface.co/datasets/stingning/ultrachat) formatted in the chat format of `LlaMa` models, so we don't have to instruction-tune the model after the pre-training. Additionally, we upsampled twice the data from these seed sources to help with commonsense and reasoning: stories, AutoMathText & KhanAcademy.
25
 
26
  We trained for 6 epochs, resulting in a model trained on 180B tokens with a sequence length of 2k, a global batch size of 1.3M tokens and a learning rate of 3e-4 with a cosine schedule for 140k steps.
27
  We used the tokenizer from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1/).