jetmoe
/

jetmoe-8b

YikangS commited on Mar 26

Commit

3930995

•

1 Parent(s): e289db5

update readme

Files changed (1) hide show

README.md CHANGED Viewed

@@ -14,13 +14,6 @@ Given the current market price of H100 GPU hours, training the model only costs
 To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
 Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
-<figure>
-<center>
-<img src="images/jetmoe_architecture.png" width="40%">
-<figcaption>JetMoE Architecture</figcaption>
-</center>
-</figure>
 ## Evaluation Results
 |Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
 |---|---|---|---|---|---|---|---|---|---|---|---|
@@ -57,6 +50,13 @@ Each MoA and MoE layer has 8 expert, and 2 experts are activated for each input
 It has 8 billion parameters in total and 2.2B active parameters.
 JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
 **Input** Models input text only.
 **Output** Models generate text only.

 To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
 Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
 ## Evaluation Results
 |Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
 |---|---|---|---|---|---|---|---|---|---|---|---|
 It has 8 billion parameters in total and 2.2B active parameters.
 JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
+<figure>
+<center>
+<img src="images/jetmoe_architecture.png" width="40%">
+<figcaption>JetMoE Architecture</figcaption>
+</center>
+</figure>
 **Input** Models input text only.
 **Output** Models generate text only.