jetmoe
/

jetmoe-8b

YikangS commited on Mar 28, 2024

Commit

f4c989b

1 Parent(s): 2e7aad4

update readme

Files changed (1) hide show

README.md CHANGED Viewed

@@ -64,11 +64,11 @@ JetMoE-8x1B is trained on 1.25T tokens from publicly available datasets, with a
 **Output** Models generate text only.
 ## Training Details
-Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-stage training method. The first stage use a constant learning rate with linear warmup. It is trained 1 trillion tokens from large scale opensource pretraining datasets, including RefinedWeb, Pile, Starcoder Github data, etc. The second stage use an annealing phase with exponential learning rate data and is trained on 250 billion tokens from phase one datasets and extra high-quality opensource datasets.
 <figure>
 <center>
-<img src="images/Phase1_data.png" width="40.3%">
-<img src="images/Phase2_data.png" width="50%">
 </center>
 </figure>

 **Output** Models generate text only.
 ## Training Details
+Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses an annealing phase with exponential learning rate data and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
 <figure>
 <center>
+<img src="images/Phase1_data.png" width="60%">
+<img src="images/Phase2_data.png" width="75%">
 </center>
 </figure>