Update README.md
Browse files
README.md
CHANGED
@@ -48,9 +48,14 @@ The model was trained using the following setup:
|
|
48 |
- **Total Training Tokens:** 2.6T
|
49 |
- **Hardware:** Trained on H100 GPUs
|
50 |
|
51 |
-
|
52 |
-
|
53 |
-
|
|
|
|
|
|
|
|
|
|
|
54 |
|
55 |
## Evaluation
|
56 |
|
|
|
48 |
- **Total Training Tokens:** 2.6T
|
49 |
- **Hardware:** Trained on H100 GPUs
|
50 |
|
51 |
+
|
52 |
+
We train our 1.4B model for 2.6T tokens on DCLM-Baseline.
|
53 |
+
Similar to the 7B model training recipe described in Appendix P of our paper,
|
54 |
+
we train for 2.3T tokens on DCLM-baseline combined with the StarCoder and ProofPile2 datasets,
|
55 |
+
with the hyper-parameters described above.
|
56 |
+
Note that we use a schedule set for the full dataset, and stop training early at 2.3T tokens.
|
57 |
+
Then, we cool down the model on the same dataset to the cooldown LR over 200B tokens.
|
58 |
+
We will update our paper soon with more training details.
|
59 |
|
60 |
## Evaluation
|
61 |
|