nxphi47 commited on
Commit
9ee132d
1 Parent(s): bf677b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -24
README.md CHANGED
@@ -75,6 +75,22 @@ By using our released weights, codes, and demos, you agree to and comply with th
75
 
76
  ## Evaluation
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ### Zero-shot CoT Multilingual Math Reasoning
79
 
80
  <!--
@@ -83,21 +99,16 @@ By using our released weights, codes, and demos, you agree to and comply with th
83
  ![fig_sea_math_side_by_side.png](fig_sea_math_side_by_side.png)
84
  -->
85
 
86
-
87
- <details>
88
- <summary>See details on English and translated GSM8K and MATH with zero-shot reasoning</summary>
89
- <br>
90
-
91
  | Model | GSM8K<br>en | MATH<br>en | GSM8K<br>zh | MATH<br>zh | GSM8K<br>vi | MATH<br>vi | GSM8K<br>id | MATH<br>id | GSM8K<br>th | MATH<br>th
92
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
93
  | GPT-3.5 | 80.8 | 34.1 | 48.2 | 21.5 | 55 | 26.5 | 64.3 | 26.4 | 35.8 | 18.1
94
- | Qwen-14B-chat | 61.4 | 18.4 | 41.6 | 11.8 | 33.6 | 3.6 | 44.7 | 8.6 | 22 | 6
95
  | Vistral-7b-chat | 48.2 | 12.5 | | | 48.7 | 3.1 | | | |
96
- | Qwen1.5-7B-chat | 56.8 | 15.3 | 40 | 2.7 | 37.7 | 9 | 36.9 | 7.7 | 21.9 |
97
  | SeaLLM-7B-v2 | 78.2 | 27.5 | 53.7 | 17.6 | 69.9 | 23.8 | 71.5 | 24.4 | 59.6 | 22.4
98
  | SeaLLM-7B-v2.5 | 78.5 | 34.9 | 51.3 | 22.1 | 72.3 | 30.2 | 71.5 | 30.1 | 62.0 | 28.4
99
 
100
- </details>
101
 
102
  Baselines were evaluated using their respective chat-template and system prompts ([Qwen1.5-7B-chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat/blob/main/tokenizer_config.json), [Vistral](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat)).
103
 
@@ -113,25 +124,10 @@ Baselines were evaluated using their respective chat-template and system prompts
113
  | SeaLLM-7B-v2.5 | 58.0 | **64.8**
114
 
115
 
116
- ### Multilingual World Knowledge
117
-
118
-
119
- We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for En, 3-shot [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) (M3e) for En, Zh, Vi, Id, Th, and zero-shot [VMLU](https://vmlu.ai/) for Vi.
120
-
121
- | Model | Langs | En<br>MMLU | En<br>M3e | Zh<br>M3e | Vi<br>M3e | Vi<br>VMLU | Id<br>M3e | Th<br>M3e
122
- |-----| ----- | --- | -- | ----- | ---- | --- | --- | --- |
123
- | GPT-3.5 | Multi | 68.90 | 75.46 | 60.20 | 58.64 | 46.32 | 49.27 | 37.41
124
- | Vistral-7B-chat | Mono | 56.86 | 67.00 | 44.56 | 54.33 | 50.03 | 36.49 | 25.27
125
- | Qwen1.5-7B-chat | Multi | 61.00 | 52.07 | 81.96 | 43.38 | 45.02 | 24.29 | 20.25
126
- | SailorLM | Multi | 52.72 | 59.76 | 67.74 | 50.14 | --- | 39.53 | 37.73
127
- | SeaLLM-7B-v2 | Multi | 61.89 | 70.91 | 55.43 | 51.15 | 45.74 | 42.25 | 35.52
128
- | SeaLLM-7B-v2.5 | Multi | 64.05 | 76.87 | 62.54 | 63.11 | 53.30 | 48.64 | 46.86
129
-
130
 
131
  ### Sea-Bench
132
 
133
- Not ready
134
-
135
 
136
 
137
  ### Usage
 
75
 
76
  ## Evaluation
77
 
78
+
79
+ ### Multilingual World Knowledge
80
+
81
+
82
+ We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for En, 3-shot [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) (M3e) for En, Zh, Vi, Id, Th, and zero-shot [VMLU](https://vmlu.ai/) for Vi.
83
+
84
+ | Model | Langs | En<br>MMLU | En<br>M3e | Zh<br>M3e | Vi<br>M3e | Vi<br>VMLU | Id<br>M3e | Th<br>M3e
85
+ |-----| ----- | --- | -- | ----- | ---- | --- | --- | --- |
86
+ | GPT-3.5 | Multi | 68.90 | 75.46 | 60.20 | 58.64 | 46.32 | 49.27 | 37.41
87
+ | Vistral-7B-chat | Mono | 56.86 | 67.00 | 44.56 | 54.33 | 50.03 | 36.49 | 25.27
88
+ | Qwen1.5-7B-chat | Multi | 61.00 | 52.07 | 81.96 | 43.38 | 45.02 | 24.29 | 20.25
89
+ | SailorLM | Multi | 52.72 | 59.76 | 67.74 | 50.14 | --- | 39.53 | 37.73
90
+ | SeaLLM-7B-v2 | Multi | 61.89 | 70.91 | 55.43 | 51.15 | 45.74 | 42.25 | 35.52
91
+ | SeaLLM-7B-v2.5 | Multi | 64.05 | 76.87 | 62.54 | 63.11 | 53.30 | 48.64 | 46.86
92
+
93
+
94
  ### Zero-shot CoT Multilingual Math Reasoning
95
 
96
  <!--
 
99
  ![fig_sea_math_side_by_side.png](fig_sea_math_side_by_side.png)
100
  -->
101
 
 
 
 
 
 
102
  | Model | GSM8K<br>en | MATH<br>en | GSM8K<br>zh | MATH<br>zh | GSM8K<br>vi | MATH<br>vi | GSM8K<br>id | MATH<br>id | GSM8K<br>th | MATH<br>th
103
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
104
  | GPT-3.5 | 80.8 | 34.1 | 48.2 | 21.5 | 55 | 26.5 | 64.3 | 26.4 | 35.8 | 18.1
105
+ | Qwen-14B-chat | 61.4 | 18.4 | 41.6 | 11.8 | 33.6 | 3.6 | 44.7 | 8.6 | 22 | 6.0
106
  | Vistral-7b-chat | 48.2 | 12.5 | | | 48.7 | 3.1 | | | |
107
+ | Qwen1.5-7B-chat | 56.8 | 15.3 | 40.0 | 2.7 | 37.7 | 9 | 36.9 | 7.7 | 21.9 | 4.7
108
  | SeaLLM-7B-v2 | 78.2 | 27.5 | 53.7 | 17.6 | 69.9 | 23.8 | 71.5 | 24.4 | 59.6 | 22.4
109
  | SeaLLM-7B-v2.5 | 78.5 | 34.9 | 51.3 | 22.1 | 72.3 | 30.2 | 71.5 | 30.1 | 62.0 | 28.4
110
 
111
+
112
 
113
  Baselines were evaluated using their respective chat-template and system prompts ([Qwen1.5-7B-chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat/blob/main/tokenizer_config.json), [Vistral](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat)).
114
 
 
124
  | SeaLLM-7B-v2.5 | 58.0 | **64.8**
125
 
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
  ### Sea-Bench
129
 
130
+ ![fig_sea_bench_side_by_side.png](fig_sea_bench_side_by_side.png)
 
131
 
132
 
133
  ### Usage