Update README.md
Browse files
README.md
CHANGED
@@ -122,6 +122,34 @@ For deployment, we recommend using vLLM. You can enable the long-context capabil
|
|
122 |
|
123 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
## Citation
|
126 |
|
127 |
If you find our work helpful, feel free to give us a cite.
|
|
|
122 |
|
123 |
**Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
124 |
|
125 |
+
## Evaluation
|
126 |
+
|
127 |
+
We briefly compare Qwen2-57B-A14B-Instruct with similar-sized instruction-tuned LLMs, including Qwen1.5-32B-Chat. The results are shown as follows:
|
128 |
+
|
129 |
+
| Datasets | Mixtral-8x7B-Instruct-v0.1 | Yi-1.5-34B-Chat | Qwen1.5-32B-Chat | **Qwen2-57B-A14B-Instruct** |
|
130 |
+
| :--- | :---: | :---: | :---: | :---: |
|
131 |
+
|Architecture | MoE | Dense | Dense | MoE |
|
132 |
+
|#Activated Params | 12B | 34B | 32B | 14B |
|
133 |
+
|#Params | 47B | 34B | 32B | 57B |
|
134 |
+
| _**English**_ | | | | |
|
135 |
+
| MMLU | 71.4 | **76.8** | 74.8 | 75.4 |
|
136 |
+
| MMLU-Pro | 43.3 | 52.3 | 46.4 | **52.8** |
|
137 |
+
| GPQA | - | - | 30.8 | **34.3** |
|
138 |
+
| TheroemQA | - | - | 30.9 | **33.1** |
|
139 |
+
| MT-Bench | 8.30 | 8.50 | 8.30 | **8.55** |
|
140 |
+
| _**Coding**_ | | | | |
|
141 |
+
| HumanEval | 45.1 | 75.2 | 68.3 | **79.9** |
|
142 |
+
| MBPP | 59.5 | **74.6** | 67.9 | 70.9 |
|
143 |
+
| MultiPL-E | - | - | 50.7 | **66.4** |
|
144 |
+
| EvalPlus | 48.5 | - | 63.6 | **71.6** |
|
145 |
+
| LiveCodeBench | 12.3 | - | 15.2 | **25.5** |
|
146 |
+
| _**Mathematics**_ | | | | |
|
147 |
+
| GSM8K | 65.7 | **90.2** | 83.6 | 79.6 |
|
148 |
+
| MATH | 30.7 | **50.1** | 42.4 | 49.1 |
|
149 |
+
| _**Chinese**_ | | | | |
|
150 |
+
| C-Eval | - | - | 76.7 | 80.5 |
|
151 |
+
| AlignBench | 5.70 | 7.20 | 7.19 | **7.36** |
|
152 |
+
|
153 |
## Citation
|
154 |
|
155 |
If you find our work helpful, feel free to give us a cite.
|