YikangS commited on
Commit
3930995
1 Parent(s): e289db5

update readme

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -14,13 +14,6 @@ Given the current market price of H100 GPU hours, training the model only costs
14
  To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
15
  Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
16
 
17
- <figure>
18
- <center>
19
- <img src="images/jetmoe_architecture.png" width="40%">
20
- <figcaption>JetMoE Architecture</figcaption>
21
- </center>
22
- </figure>
23
-
24
  ## Evaluation Results
25
  |Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
26
  |---|---|---|---|---|---|---|---|---|---|---|---|
@@ -57,6 +50,13 @@ Each MoA and MoE layer has 8 expert, and 2 experts are activated for each input
57
  It has 8 billion parameters in total and 2.2B active parameters.
58
  JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
59
 
 
 
 
 
 
 
 
60
  **Input** Models input text only.
61
 
62
  **Output** Models generate text only.
 
14
  To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
15
  Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
16
 
 
 
 
 
 
 
 
17
  ## Evaluation Results
18
  |Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
19
  |---|---|---|---|---|---|---|---|---|---|---|---|
 
50
  It has 8 billion parameters in total and 2.2B active parameters.
51
  JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
52
 
53
+ <figure>
54
+ <center>
55
+ <img src="images/jetmoe_architecture.png" width="40%">
56
+ <figcaption>JetMoE Architecture</figcaption>
57
+ </center>
58
+ </figure>
59
+
60
  **Input** Models input text only.
61
 
62
  **Output** Models generate text only.