Update README.md

Browse files

Files changed (1) hide show

README.md +21 -0

README.md CHANGED Viewed

	@@ -119,3 +119,24 @@ model = model.quantize(4).cuda()
119
120	> 说明：CMMLU 是一个综合性的中文评估基准，专门用于评估语言模型在中文语境下的知识和推理能力。我们直接使用其官方的[评测脚本](https://github.com/haonan-li/CMMLU)对模型进行评测。Model zero-shot 表格中 [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-13B) 的得分来自我们直接运行 CMMLU 官方的评测脚本得到，其他模型的的得分来自于 [CMMLU](https://github.com/haonan-li/CMMLU/tree/master) 官方的评测结果.
121

 > 说明：CMMLU 是一个综合性的中文评估基准，专门用于评估语言模型在中文语境下的知识和推理能力。我们直接使用其官方的[评测脚本](https://github.com/haonan-li/CMMLU)对模型进行评测。Model zero-shot 表格中 [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-13B) 的得分来自我们直接运行 CMMLU 官方的评测脚本得到，其他模型的的得分来自于 [CMMLU](https://github.com/haonan-li/CMMLU/tree/master) 官方的评测结果.
+### English Leaderboard
+In addition to Chinese, we also tested the model's performance in English.
+#### MMLU
+[MMLU](https://arxiv.org/abs/2009.03300) is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty ranges from high school level to expert level, making it a mainstream LLM evaluation dataset.
+We adopted the [open-source]((https://github.com/hendrycks/test)) evaluation scheme, and the final 5-shot results are as follows:
+| Model                                  | Humanities | Social Sciences | STEM | Other | Average |
+|----------------------------------------|-----------:|:---------------:|:----:|:-----:|:-------:|
+| LLaMA-7B<sup>2</sup>                   |       34.0 |      38.3       | 30.5 | 38.1  |  35.1   |
+| Falcon-7B<sup>1</sup>                  |          - |        -        |  -   |   -   |  35.0   |
+| mpt-7B<sup>1</sup>                     |          - |        -        |  -   |   -   |  35.6   |
+| ChatGLM-6B<sup>0</sup>                 |       35.4 |      41.0       | 31.3 | 40.5  |  36.9   |
+| BLOOM 7B<sup>0</sup>                   |       25.0 |      24.4       | 26.5 | 26.4  |  25.5   |
+| BLOOMZ 7B<sup>0</sup>                  |       31.3 |      42.1       | 34.4 | 39.0  |  36.1   |
+| moss-moon-003-base (16B)<sup>0</sup>   |       24.2 |      22.8       | 22.4 | 24.4  |  23.6   |
+| moss-moon-003-sft (16B)<sup>0</sup>    |       30.5 |      33.8       | 29.3 | 34.4  |  31.9   |
+| Baichuan-7B<sup>0</sup>                |       38.4 |      48.9       | 35.6 | 48.1  |  42.3   |
+| **Baichuan-7B<sup>0</sup>**            |       **38.9** |      **49.0**       | **35.3** | **48.8**  |  **42.6**   |