Update README.md
Browse files
README.md
CHANGED
@@ -1,41 +1,78 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
21 |
-
|Shot
|
22 |
-
|Metric
|
23 |
-
|LLaMA2-7B|7B|2T|53.1|78.6|46.9|38.8|74|14.5|
|
24 |
-
|LLaMA-13B|13B|1T|**56.2**|**80.9**|47.7|39.5|**76.2**|7.6|
|
25 |
-
|DeepseekMoE-16B|2.8B|2T|53.2|79.8|46.3|36.1|73.7|17.3|
|
26 |
-
|Gemma-2B|2B|2T|48.4|71.8|41.8|33.1|66.3|16.9|
|
27 |
-
|JetMoE-8B|2.2B|1.25T
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
## Model Usage
|
30 |
-
To load the models, you need install
|
31 |
```
|
32 |
pip install -e .
|
33 |
```
|
34 |
|
35 |
Then you can load the model with the following code:
|
36 |
-
```
|
37 |
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
|
38 |
from jetmoe import JetMoEForCausalLM, JetMoEConfig, JetMoEForSequenceClassification
|
|
|
39 |
AutoConfig.register("jetmoe", JetMoEConfig)
|
40 |
AutoModelForCausalLM.register(JetMoEConfig, JetMoEForCausalLM)
|
41 |
AutoModelForSequenceClassification.register(JetMoEConfig, JetMoEForSequenceClassification)
|
@@ -59,10 +96,6 @@ JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a le
|
|
59 |
</center>
|
60 |
</figure>
|
61 |
|
62 |
-
**Input** Models input text only.
|
63 |
-
|
64 |
-
**Output** Models generate text only.
|
65 |
-
|
66 |
## Training Details
|
67 |
Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
|
68 |
|
@@ -73,29 +106,14 @@ Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCP
|
|
73 |
</center>
|
74 |
</figure>
|
75 |
|
76 |
-
## Authors
|
77 |
-
This project is currently contributed by the following authors:
|
78 |
-
- Yikang Shen
|
79 |
-
- Zhen Guo
|
80 |
-
- Tianle Cai
|
81 |
-
- Zengyi Qin
|
82 |
-
|
83 |
## Technical Report
|
84 |
For more details, please refer to the JetMoE Technical Report (Coming Soon).
|
85 |
|
86 |
-
<!-- ## Citation
|
87 |
-
|
88 |
-
Please cite the following paper if you use the data or code in this repo.
|
89 |
-
|
90 |
-
```
|
91 |
-
@article{
|
92 |
-
}
|
93 |
-
``` -->
|
94 |
-
|
95 |
## JetMoE Model Index
|
96 |
|Model|Index|
|
97 |
|---|---|
|
98 |
|JetMoE-8B| [Link](https://huggingface.co/jetmoe/jetmoe-8B) |
|
99 |
|
100 |
-
##
|
101 |
-
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
# JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars
|
6 |
+
|
7 |
+
|
8 |
+
<div align="center">
|
9 |
+
<div> </div>
|
10 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/641de0213239b631552713e4/ieHnwuczidNNoGRA_FN2y.png" width="500"/>
|
11 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/641de0213239b631552713e4/UOsk9_zcbHpCCy6kmryYM.png" width="530"/>
|
12 |
+
</div>
|
13 |
+
|
14 |
+
## Key Messages
|
15 |
+
|
16 |
+
1. JetMoE-8B is **trained with less than $ 0.1 million**<sup>1</sup> **cost but outperforms LLaMA2-7B from Meta AI**, who has multi-billion-dollar training resources. LLM training can be **much cheaper than people generally thought**.
|
17 |
+
|
18 |
+
2. JetMoE-8B is **fully open-sourced and academia-friendly** because:
|
19 |
+
- It **only uses public datasets** for training, and the code is open-sourced. No proprietary resource is needed.
|
20 |
+
- It **can be finetuned with very limited compute budget** (e.g., consumer-grade GPU) that most labs can afford.
|
21 |
+
|
22 |
+
3. JetMoE-8B **only has 2.2B active parameters** during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.
|
23 |
+
|
24 |
+
<sup>1</sup> We used a 96×H100 GPU cluster for 2 weeks, which cost ~$0.08 million.
|
25 |
+
|
26 |
+
Website: [https://research.myshell.ai/jetmoe](https://research.myshell.ai/jetmoe)
|
27 |
+
|
28 |
+
HuggingFace: [https://huggingface.co/jetmoe/jetmoe-8b](https://huggingface.co/jetmoe/jetmoe-8b)
|
29 |
+
|
30 |
+
Online Demo on Lepton AI: [https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat](https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat)
|
31 |
+
|
32 |
+
## Authors
|
33 |
+
|
34 |
+
The project is contributed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ), [Zhen Guo](https://zguo0525.github.io/), [Tianle Cai](https://www.tianle.website/#/) and [Zengyi Qin](https://www.qinzy.tech/). For technical inquiries, please contact [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ). For media and collaboration inquiries, please contact [Zengyi Qin](https://www.qinzy.tech/).
|
35 |
+
|
36 |
+
## Collaboration
|
37 |
+
**If you have great ideas but need more resources (GPU, data, funding, etc.)**, welcome to contact **MyShell.ai** via [Zengyi Qin](https://www.qinzy.tech/). **MyShell.ai** is open to collaborations and are actively supporting high-quality open-source projects.
|
38 |
+
|
39 |
+
## Benchmarks
|
40 |
+
We use the same evaluation methodology as in the Open LLM leaderboard. For MBPP code benchmark, we use the same evaluation methodology as in the LLaMA2 and Deepseek-MoE paper. The results are shown below:
|
41 |
+
|
42 |
+
|Model|Activate Params|Training Tokens|Open LLM Leaderboard Avg|ARC|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|MBPP|HumanEval|
|
43 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
44 |
+
|Shot||||25|10|5|0|5|5|3|0|
|
45 |
+
|Metric||||acc_norm|acc_norm|acc|mc2|acc|acc|Pass@1|Pass@1|
|
46 |
+
|LLaMA2-7B|7B|2T|51.0|53.1|78.6|46.9|38.8|74|14.5|20.8|12.8|
|
47 |
+
|LLaMA-13B|13B|1T|51.4|**56.2**|**80.9**|47.7|39.5|**76.2**|7.6|22.0|15.8|
|
48 |
+
|DeepseekMoE-16B|2.8B|2T|51.1|53.2|79.8|46.3|36.1|73.7|17.3|34.0|**25.0**|
|
49 |
+
|Gemma-2B|2B|2T|46.4|48.4|71.8|41.8|33.1|66.3|16.9|28.0|24.4|
|
50 |
+
|JetMoE-8B|2.2B|1.25T|**53.0**|48.7|80.5|**49.2**|**41.7**|70.2|**27.8**|**34.2**|14.6|
|
51 |
+
|
52 |
+
| Model | MT-Bench Score |
|
53 |
+
|---------------------|-----------|
|
54 |
+
| GPT-4 | 9.014 |
|
55 |
+
| GPT-3.5-turbo | 7.995 |
|
56 |
+
| Claude-v1 | 7.923 |
|
57 |
+
| **JetMoE-8B-chat** | **6.681** |
|
58 |
+
| Llama-2-13b-chat | 6.650 |
|
59 |
+
| Vicuna-13b-v1.3 | 6.413 |
|
60 |
+
| Wizardlm-13b | 6.353 |
|
61 |
+
| Llama-2-7b-chat | 6.269 |
|
62 |
+
|
63 |
+
To our surprise, despite the lower training cost and computation, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B. Compared to a model with similar training and inference computation, like Gemma-2B, JetMoE-8B achieves better performance.
|
64 |
|
65 |
## Model Usage
|
66 |
+
To load the models, you need install this package:
|
67 |
```
|
68 |
pip install -e .
|
69 |
```
|
70 |
|
71 |
Then you can load the model with the following code:
|
72 |
+
```python
|
73 |
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
|
74 |
from jetmoe import JetMoEForCausalLM, JetMoEConfig, JetMoEForSequenceClassification
|
75 |
+
|
76 |
AutoConfig.register("jetmoe", JetMoEConfig)
|
77 |
AutoModelForCausalLM.register(JetMoEConfig, JetMoEForCausalLM)
|
78 |
AutoModelForSequenceClassification.register(JetMoEConfig, JetMoEForSequenceClassification)
|
|
|
96 |
</center>
|
97 |
</figure>
|
98 |
|
|
|
|
|
|
|
|
|
99 |
## Training Details
|
100 |
Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
|
101 |
|
|
|
106 |
</center>
|
107 |
</figure>
|
108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
## Technical Report
|
110 |
For more details, please refer to the JetMoE Technical Report (Coming Soon).
|
111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
## JetMoE Model Index
|
113 |
|Model|Index|
|
114 |
|---|---|
|
115 |
|JetMoE-8B| [Link](https://huggingface.co/jetmoe/jetmoe-8B) |
|
116 |
|
117 |
+
## Acknowledgement
|
118 |
+
We express our gratitude to [Shengding Hu](https://shengdinghu.github.io/) for his valuable advice on the Phase 2 data mixture. We also express our gratitude to [Exabits](https://www.exabits.ai/) for their assistance in setting up the GPU clusters, and to [Lepton AI](https://www.lepton.ai/) for their support in setting up the chat demo.
|
119 |
+
|