hyxmmm commited on
Commit
a1ce063
1 Parent(s): 2842b0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -260
README.md CHANGED
@@ -18,6 +18,7 @@ language:
18
  Infinity-Instruct-3M-0613-Mistral-7B is an opensource supervised instruction tuning model without reinforcement learning from human feedback (RLHF). This model is just finetuned on [Infinity-Instruct-3M and Infinity-Instruct-0613](https://huggingface.co/datasets/BAAI/Infinity-Instruct) and showing favorable results on AlpacaEval 2.0 compared to Mixtral 8x7B v0.1, Gemini Pro, and GPT-3.5.
19
 
20
  ## **Training Details**
 
21
  <p align="center">
22
  <img src="fig/trainingflow.png">
23
  </p>
@@ -41,64 +42,16 @@ Thanks to [FlagScale](https://github.com/FlagOpen/FlagScale), we could concatena
41
 
42
  ## **Benchmark**
43
 
44
- <style type="text/css">
45
- .tg {border-collapse:collapse;border-spacing:0;}
46
- .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
47
- overflow:hidden;padding:10px 5px;word-break:normal;}
48
- .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
49
- font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
50
- .tg .tg-baqh{text-align:center;vertical-align:top}
51
- .tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
52
- .tg .tg-0lax{text-align:left;vertical-align:top}
53
- </style>
54
- <table class="tg"><thead>
55
- <tr>
56
- <th class="tg-amwm">Model</th>
57
- <th class="tg-amwm">MT-Bench</th>
58
- <th class="tg-amwm">AlpacaEval2.0</th>
59
- </tr></thead>
60
- <tbody>
61
- <tr>
62
- <td class="tg-0lax">OpenHermes-2.5-Mistral-7B*</td>
63
- <td class="tg-baqh">7.5</td>
64
- <td class="tg-baqh">16.2</td>
65
- </tr>
66
- <tr>
67
- <td class="tg-0lax">Mistral-7B-Instruct-v0.2</td>
68
- <td class="tg-baqh">7.6</td>
69
- <td class="tg-baqh">17.1</td>
70
- </tr>
71
- <tr>
72
- <td class="tg-0lax">Llama-3-8B-Instruct</td>
73
- <td class="tg-baqh">8.1</td>
74
- <td class="tg-baqh">22.9</td>
75
- </tr>
76
- <tr>
77
- <td class="tg-0lax">GPT 3.5 Turbo 0613</td>
78
- <td class="tg-baqh">8.4</td>
79
- <td class="tg-baqh">22.7</td>
80
- </tr>
81
- <tr>
82
- <td class="tg-0lax">Mixtral 8x7B v0.1</td>
83
- <td class="tg-baqh">8.3</td>
84
- <td class="tg-baqh">23.7</td>
85
- </tr>
86
- <tr>
87
- <td class="tg-0lax">Gemini Pro</td>
88
- <td class="tg-baqh">--</td>
89
- <td class="tg-baqh">24.4</td>
90
- </tr>
91
- <tr>
92
- <td class="tg-0lax">InfInstruct-3M-Mistral-7B*</td>
93
- <td class="tg-baqh">7.6</td>
94
- <td class="tg-baqh">16.2</td>
95
- </tr>
96
- <tr>
97
- <td class="tg-0lax">InfInstruct-3M-0613-Mistral-7B*</td>
98
- <td class="tg-baqh">8.1</td>
99
- <td class="tg-amwm">25.5</td>
100
- </tr>
101
- </tbody></table>
102
 
103
  *denote the model is finetuned without reinforcement learning from human feedback (RLHF).
104
 
@@ -108,208 +61,9 @@ We evaluate Infinity-Instruct-3M-0613-Mistral-7B on the two most popular instruc
108
 
109
  We also evaluate Infinity-Instruct-3M-0613-Mistral-7B on diverse objective downstream tasks with [Opencompass](https://opencompass.org.cn):
110
 
111
- <style type="text/css">
112
- .tg {border-collapse:collapse;border-spacing:0;}
113
- .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
114
- overflow:hidden;padding:10px 5px;word-break:normal;}
115
- .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
116
- font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
117
- .tg .tg-baqh{text-align:center;vertical-align:top}
118
- .tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
119
- .tg .tg-nrix{text-align:center;vertical-align:middle}
120
- </style>
121
- <table class="tg"><thead>
122
- <tr>
123
- <th class="tg-amwm" colspan="2">Benchmark</th>
124
- <th class="tg-amwm">Infinity-Instruct-3M-Mistral-7B</th>
125
- <th class="tg-amwm">Infinity-Instruct-3M-0613-Mistral-7B</th>
126
- <th class="tg-amwm">Mistral-7B-v0.1</th>
127
- <th class="tg-amwm">mistral-7B instruction v0.2</th>
128
- <th class="tg-amwm">teknium/OpenHermes-2.5-Mistral-7B</th>
129
- </tr></thead>
130
- <tbody>
131
- <tr>
132
- <td class="tg-nrix" rowspan="7">GPT4ALL</td>
133
- <td class="tg-baqh">ARC-c</td>
134
- <td class="tg-amwm">82.37</td>
135
- <td class="tg-baqh">83.30</td>
136
- <td class="tg-baqh">69.15</td>
137
- <td class="tg-baqh">73.22</td>
138
- <td class="tg-baqh">78.31</td>
139
- </tr>
140
- <tr>
141
- <td class="tg-baqh">ARC-e</td>
142
- <td class="tg-amwm">92.42</td>
143
- <td class="tg-baqh">90.65</td>
144
- <td class="tg-baqh">79.54</td>
145
- <td class="tg-baqh">82.01</td>
146
- <td class="tg-baqh">88.54</td>
147
- </tr>
148
- <tr>
149
- <td class="tg-baqh">Hellaswag</td>
150
- <td class="tg-amwm">84.82</td>
151
- <td class="tg-baqh">76.88</td>
152
- <td class="tg-baqh">35.50</td>
153
- <td class="tg-baqh">64.40</td>
154
- <td class="tg-baqh">80.53</td>
155
- </tr>
156
- <tr>
157
- <td class="tg-baqh">Winogrande</td>
158
- <td class="tg-baqh">61.75</td>
159
- <td class="tg-baqh">52.63</td>
160
- <td class="tg-baqh">54.04</td>
161
- <td class="tg-baqh">57.89</td>
162
- <td class="tg-amwm">62.11</td>
163
- </tr>
164
- <tr>
165
- <td class="tg-baqh">BoolQ</td>
166
- <td class="tg-amwm">87.85</td>
167
- <td class="tg-baqh">86.45</td>
168
- <td class="tg-baqh">50.09</td>
169
- <td class="tg-baqh">55.75</td>
170
- <td class="tg-baqh">87.34</td>
171
- </tr>
172
- <tr>
173
- <td class="tg-baqh">PIQA</td>
174
- <td class="tg-amwm">87.11</td>
175
- <td class="tg-baqh">86.13</td>
176
- <td class="tg-baqh">60.39</td>
177
- <td class="tg-baqh">72.36</td>
178
- <td class="tg-baqh">80.14</td>
179
- </tr>
180
- <tr>
181
- <td class="tg-baqh">OBQA</td>
182
- <td class="tg-amwm">83.00</td>
183
- <td class="tg-baqh">79.40</td>
184
- <td class="tg-baqh">62.60</td>
185
- <td class="tg-baqh">68.00</td>
186
- <td class="tg-baqh">81.00</td>
187
- </tr>
188
- <tr>
189
- <td class="tg-nrix" rowspan="4">Commonsense QA</td>
190
- <td class="tg-baqh">MMLU</td>
191
- <td class="tg-baqh">62.85</td>
192
- <td class="tg-amwm">63.62</td>
193
- <td class="tg-baqh">56.49</td>
194
- <td class="tg-baqh">59.56</td>
195
- <td class="tg-baqh">63.16</td>
196
- </tr>
197
- <tr>
198
- <td class="tg-baqh">NQ</td>
199
- <td class="tg-baqh">24.46</td>
200
- <td class="tg-baqh">27.48</td>
201
- <td class="tg-baqh">13.99</td>
202
- <td class="tg-baqh">18.42</td>
203
- <td class="tg-amwm">28.84</td>
204
- </tr>
205
- <tr>
206
- <td class="tg-baqh">TriviaQA</td>
207
- <td class="tg-baqh">60.85</td>
208
- <td class="tg-amwm">64.06</td>
209
- <td class="tg-baqh">63.99</td>
210
- <td class="tg-baqh">59.21</td>
211
- <td class="tg-baqh">63.72</td>
212
- </tr>
213
- <tr>
214
- <td class="tg-baqh">GPQA</td>
215
- <td class="tg-baqh">27.27</td>
216
- <td class="tg-amwm">27.78</td>
217
- <td class="tg-baqh">23.23</td>
218
- <td class="tg-baqh">19.19</td>
219
- <td class="tg-baqh">26.77</td>
220
- </tr>
221
- <tr>
222
- <td class="tg-nrix" rowspan="4">MATH &amp; Reasoning</td>
223
- <td class="tg-baqh">GSM8K</td>
224
- <td class="tg-baqh">78.09</td>
225
- <td class="tg-amwm">79.83</td>
226
- <td class="tg-baqh">48.07</td>
227
- <td class="tg-baqh">45.94</td>
228
- <td class="tg-baqh">73.62</td>
229
- </tr>
230
- <tr>
231
- <td class="tg-baqh">Math</td>
232
- <td class="tg-amwm">28.38</td>
233
- <td class="tg-baqh">23.30</td>
234
- <td class="tg-baqh">11.76</td>
235
- <td class="tg-baqh">9.46</td>
236
- <td class="tg-baqh">17.32</td>
237
- </tr>
238
- <tr>
239
- <td class="tg-baqh">BBH</td>
240
- <td class="tg-baqh">59.61</td>
241
- <td class="tg-amwm">61.07</td>
242
- <td class="tg-baqh">56.65</td>
243
- <td class="tg-baqh">49.15</td>
244
- <td class="tg-baqh">60.41</td>
245
- </tr>
246
- <tr>
247
- <td class="tg-baqh">DROP</td>
248
- <td class="tg-amwm">68.17</td>
249
- <td class="tg-baqh">65.62</td>
250
- <td class="tg-baqh">3.06</td>
251
- <td class="tg-baqh">6.98</td>
252
- <td class="tg-baqh">64.49</td>
253
- </tr>
254
- <tr>
255
- <td class="tg-nrix" rowspan="2">Code</td>
256
- <td class="tg-baqh">HumanEval</td>
257
- <td class="tg-baqh">50.61</td>
258
- <td class="tg-amwm">51.22</td>
259
- <td class="tg-baqh">14.02</td>
260
- <td class="tg-baqh">32.93</td>
261
- <td class="tg-baqh">43.29</td>
262
- </tr>
263
- <tr>
264
- <td class="tg-baqh">MBPP</td>
265
- <td class="tg-amwm">46.00</td>
266
- <td class="tg-baqh">44.80</td>
267
- <td class="tg-baqh">38.00</td>
268
- <td class="tg-baqh">3.80</td>
269
- <td class="tg-baqh">41.80</td>
270
- </tr>
271
- <tr>
272
- <td class="tg-nrix" rowspan="4">Chinese</td>
273
- <td class="tg-baqh">AGI Eval</td>
274
- <td class="tg-amwm">42.24</td>
275
- <td class="tg-baqh">40.43</td>
276
- <td class="tg-baqh">27.92</td>
277
- <td class="tg-baqh">35.78</td>
278
- <td class="tg-baqh">36.32</td>
279
- </tr>
280
- <tr>
281
- <td class="tg-baqh">c-eval</td>
282
- <td class="tg-baqh">48.62</td>
283
- <td class="tg-amwm">49.00</td>
284
- <td class="tg-baqh">46.83</td>
285
- <td class="tg-baqh">42.58</td>
286
- <td class="tg-baqh">44.30</td>
287
- </tr>
288
- <tr>
289
- <td class="tg-baqh">cmmlu</td>
290
- <td class="tg-baqh">46.67</td>
291
- <td class="tg-amwm">48.07</td>
292
- <td class="tg-baqh">34.59</td>
293
- <td class="tg-baqh">42.05</td>
294
- <td class="tg-baqh">43.05</td>
295
- </tr>
296
- <tr>
297
- <td class="tg-baqh">gaokao</td>
298
- <td class="tg-baqh">12.54</td>
299
- <td class="tg-baqh">14.48</td>
300
- <td class="tg-baqh">13.24</td>
301
- <td class="tg-baqh">12.10</td>
302
- <td class="tg-amwm">15.00</td>
303
- </tr>
304
- <tr>
305
- <td class="tg-amwm" colspan="2">AVERAGE</td>
306
- <td class="tg-amwm">58.84</td>
307
- <td class="tg-baqh">57.91</td>
308
- <td class="tg-baqh">41.10</td>
309
- <td class="tg-baqh">43.37</td>
310
- <td class="tg-baqh">56.19</td>
311
- </tr>
312
- </tbody></table>
313
 
314
  ## **How to use**
315
 
 
18
  Infinity-Instruct-3M-0613-Mistral-7B is an opensource supervised instruction tuning model without reinforcement learning from human feedback (RLHF). This model is just finetuned on [Infinity-Instruct-3M and Infinity-Instruct-0613](https://huggingface.co/datasets/BAAI/Infinity-Instruct) and showing favorable results on AlpacaEval 2.0 compared to Mixtral 8x7B v0.1, Gemini Pro, and GPT-3.5.
19
 
20
  ## **Training Details**
21
+
22
  <p align="center">
23
  <img src="fig/trainingflow.png">
24
  </p>
 
42
 
43
  ## **Benchmark**
44
 
45
+ | **Model** | **MT-Bench** | **AlpacaEval2.0** |
46
+ |:-------------------------------:|:------------:|:-----------------:|
47
+ | OpenHermes-2.5-Mistral-7B* | 7.5 | 16.2 |
48
+ | Mistral-7B-Instruct-v0.2 | 7.6 | 17.1 |
49
+ | Llama-3-8B-Instruct | 8.1 | 22.9 |
50
+ | GPT 3.5 Turbo 0613 | 8.4 | 22.7 |
51
+ | Mixtral 8x7B v0.1 | 8.3 | 23.7 |
52
+ | Gemini Pro | -- | 24.4 |
53
+ | InfInstruct-3M-Mistral-7B* | 7.6 | 16.2 |
54
+ | InfInstruct-3M-0613-Mistral-7B* | 8.1 | **25.5** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  *denote the model is finetuned without reinforcement learning from human feedback (RLHF).
57
 
 
61
 
62
  We also evaluate Infinity-Instruct-3M-0613-Mistral-7B on diverse objective downstream tasks with [Opencompass](https://opencompass.org.cn):
63
 
64
+ <p align="center">
65
+ <img src="fig/result.png">
66
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ## **How to use**
69