YikangS commited on
Commit
f4c989b
1 Parent(s): 2e7aad4

update readme

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -64,11 +64,11 @@ JetMoE-8x1B is trained on 1.25T tokens from publicly available datasets, with a
64
  **Output** Models generate text only.
65
 
66
  ## Training Details
67
- Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-stage training method. The first stage use a constant learning rate with linear warmup. It is trained 1 trillion tokens from large scale opensource pretraining datasets, including RefinedWeb, Pile, Starcoder Github data, etc. The second stage use an annealing phase with exponential learning rate data and is trained on 250 billion tokens from phase one datasets and extra high-quality opensource datasets.
68
  <figure>
69
  <center>
70
- <img src="images/Phase1_data.png" width="40.3%">
71
- <img src="images/Phase2_data.png" width="50%">
72
  </center>
73
  </figure>
74
 
 
64
  **Output** Models generate text only.
65
 
66
  ## Training Details
67
+ Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses an annealing phase with exponential learning rate data and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
68
  <figure>
69
  <center>
70
+ <img src="images/Phase1_data.png" width="60%">
71
+ <img src="images/Phase2_data.png" width="75%">
72
  </center>
73
  </figure>
74