jetmoe
/

jetmoe-8b

Text Generation

Transformers

Safetensors

jetmoe

Inference Endpoints

Model card Files Files and versions Community

YikangS commited on Mar 26, 2024

Commit

5120a94

1 Parent(s): 1f5823d

update readme

Browse files

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -3,16 +3,16 @@ license: apache-2.0
 ---
 # **JetMoE**
 **JetMoE-8B** is an 8B Mixture-of-Experts (MoE) language model developed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ) and [MyShell](https://myshell.ai/).
-The goal of JetMoE is to provide a LLaMA2-level performance and efficient language model with a very limited budget.
 To achieve this goal, JetMoE uses a sparsely activated architecture inspired by the [ModuleFormer](https://arxiv.org/abs/2306.04640).
 Each JetMoE block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts.
 Given the input tokens, it activates a subset of its experts to process them.
 Thus, JetMoE-8B has 8B parameters in total, but only 2B are activated for each input token.
-This sparse activation schema enables JetMoE achieve much better training throughput compared to similar size dense models.
 The model is trained with 1.25T tokens from publicly available datasets on 96 H100s within 13 days.
-Given the current market price of H100 GPU hours, training the model only costs around 0.1 million dollars.
 To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
-Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
 ## Evaluation Results
 For most benchmarks, we use the same evaluation methodology as in the Open LLM leaderboard. For code benchmarks, we use the same evaluation methodology as in the LLaMA2 and Deepseek MoE paper. The evaluation results are as follows:

 ---
 # **JetMoE**
 **JetMoE-8B** is an 8B Mixture-of-Experts (MoE) language model developed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ) and [MyShell](https://myshell.ai/).
+JetMoE project aims to provide an LLaMA2-level performance and efficient language model with a limited budget.
 To achieve this goal, JetMoE uses a sparsely activated architecture inspired by the [ModuleFormer](https://arxiv.org/abs/2306.04640).
 Each JetMoE block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts.
 Given the input tokens, it activates a subset of its experts to process them.
 Thus, JetMoE-8B has 8B parameters in total, but only 2B are activated for each input token.
+This sparse activation schema enables JetMoE to achieve much better training throughput than similar size dense models.
 The model is trained with 1.25T tokens from publicly available datasets on 96 H100s within 13 days.
+Given the current market price of H100 GPU hours, training the model costs around 0.1 million dollars.
 To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
+Compared to a model with similar training and inference computation, like Gemma-2B, JetMoE-8B achieves significantly better performance.
 ## Evaluation Results
 For most benchmarks, we use the same evaluation methodology as in the Open LLM leaderboard. For code benchmarks, we use the same evaluation methodology as in the LLaMA2 and Deepseek MoE paper. The evaluation results are as follows: