ChloeAuYeung
commited on
Commit
•
3f1c9ac
1
Parent(s):
6dbe66b
Update README.md
Browse files
README.md
CHANGED
@@ -147,19 +147,19 @@ For the Code data, the following table shows the proportion of different program
|
|
147 |
|
148 |
为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
|
149 |
|
150 |
-
| 能力维度 | 数据集 | | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
151 |
-
| :--------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
152 |
-
| 中文问答 | C-Eval | 5-shot | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
153 |
-
| | CMMLU | 5-shot | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
154 |
-
| | Gaokao-Bench<sup>1</sup> | 5-shot | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
155 |
-
| 英文问答 | MMLU | 5-shot | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
156 |
-
| | GAOKAO-English<sup>1</sup> | 5-shot | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
157 |
-
| 中英文问答 | AGIEval<sup>1</sup> | 5-shot | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
158 |
-
| 语言理解 | RACE-M | 0-shot | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
159 |
-
| 常识问答 | CommonSenseQA | 7-shot | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
160 |
-
| 推理 | PIQA | 0-shot | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
161 |
-
| 数学 | GSM8K | 4-shot | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
162 |
-
| 代码 | HumanEval | 0-shot | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
163 |
|
164 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
165 |
|
@@ -170,19 +170,19 @@ For the Code data, the following table shows the proportion of different program
|
|
170 |
|
171 |
To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
|
172 |
|
173 |
-
| Capability Dimension | Dataset | | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
174 |
-
| :--------------------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
175 |
-
| Chinese QA | C-Eval | 5-shot | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
176 |
-
| | CMMLU | 5-shot | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
177 |
-
| | Gaokao-Bench<sup>1</sup> | 5-shot | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
178 |
-
| English QA | MMLU | 5-shot | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
179 |
-
| | GAOKAO-English<sup>1</sup> | 5-shot | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
180 |
-
| Chinese & English QA | AGIEval<sup>1</sup> | 5-shot | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
181 |
-
| Language Understanding | RACE-M | 0-shot | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
182 |
-
| Common Sense QA | CommonSenseQA | 7-shot | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
183 |
-
| Reasoning | PIQA | 0-shot | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
184 |
-
| Math | GSM8K | 4-shot | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
185 |
-
| Coding | HumanEval | 0-shot | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
186 |
|
187 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
188 |
|
|
|
147 |
|
148 |
为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
|
149 |
|
150 |
+
| 能力维度 | 数据集 | | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
151 |
+
| :--------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
152 |
+
| 中文问答 | C-Eval | 5-shot | 72.4 | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
153 |
+
| | CMMLU | 5-shot | 75.1 | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
154 |
+
| | Gaokao-Bench<sup>1</sup> | 5-shot | 76.9 | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
155 |
+
| 英文问答 | MMLU | 5-shot | 74.4 | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
156 |
+
| | GAOKAO-English<sup>1</sup> | 5-shot | 86.6 | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
157 |
+
| 中英文问答 | AGIEval<sup>1</sup> | 5-shot | 66.2 | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
158 |
+
| 语言理解 | RACE-M | 0-shot | 90.7 | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
159 |
+
| 常识问答 | CommonSenseQA | 7-shot | 81.1 | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
160 |
+
| 推理 | PIQA | 0-shot | 79.4 | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
161 |
+
| 数学 | GSM8K | 4-shot | 72.6 | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
162 |
+
| 代码 | HumanEval | 0-shot | 37.8 | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
163 |
|
164 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
165 |
|
|
|
170 |
|
171 |
To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
|
172 |
|
173 |
+
| Capability Dimension | Dataset | | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
174 |
+
| :--------------------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
175 |
+
| Chinese QA | C-Eval | 5-shot | 72.4 | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
176 |
+
| | CMMLU | 5-shot | 75.1 | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
177 |
+
| | Gaokao-Bench<sup>1</sup> | 5-shot | 76.9 | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
178 |
+
| English QA | MMLU | 5-shot | 74.4 | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
179 |
+
| | GAOKAO-English<sup>1</sup> | 5-shot | 86.6 | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
180 |
+
| Chinese & English QA | AGIEval<sup>1</sup> | 5-shot | 66.2 | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
181 |
+
| Language Understanding | RACE-M | 0-shot | 90.7 | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
182 |
+
| Common Sense QA | CommonSenseQA | 7-shot | 81.1 | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
183 |
+
| Reasoning | PIQA | 0-shot | 79.4 | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
184 |
+
| Math | GSM8K | 4-shot | 72.6 | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
185 |
+
| Coding | HumanEval | 0-shot | 37.8 | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
186 |
|
187 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
188 |
|