alexmarques commited on
Commit
d292013
·
verified ·
1 Parent(s): f2c944e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +179 -22
README.md CHANGED
@@ -32,8 +32,9 @@ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
32
  - **License(s):** Llama3.1
33
  - **Model Developers:** Neural Magic
34
 
35
- Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
- It achieves scores within 1% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
 
37
 
38
  ### Model Optimizations
39
 
@@ -131,9 +132,19 @@ model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w8a8")
131
 
132
  ## Evaluation
133
 
134
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
135
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
136
- This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 
 
 
 
 
 
 
 
 
 
137
 
138
  **Note:** Results have been updated after Meta modified the chat template.
139
 
@@ -151,12 +162,26 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
151
  <td><strong>Recovery</strong>
152
  </td>
153
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  <tr>
155
  <td>MMLU (5-shot)
156
  </td>
157
- <td>68.32
158
  </td>
159
- <td>67.83
160
  </td>
161
  <td>99.3%
162
  </td>
@@ -164,9 +189,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
164
  <tr>
165
  <td>MMLU (CoT, 0-shot)
166
  </td>
167
- <td>72.83
168
  </td>
169
- <td>72.18
170
  </td>
171
  <td>99.1%
172
  </td>
@@ -174,9 +199,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
174
  <tr>
175
  <td>ARC Challenge (0-shot)
176
  </td>
177
- <td>81.40
178
  </td>
179
- <td>81.66
180
  </td>
181
  <td>100.3%
182
  </td>
@@ -184,9 +209,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
184
  <tr>
185
  <td>GSM-8K (CoT, 8-shot, strict-match)
186
  </td>
187
- <td>82.79
188
  </td>
189
- <td>84.84
190
  </td>
191
  <td>102.5%
192
  </td>
@@ -194,9 +219,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
194
  <tr>
195
  <td>Hellaswag (10-shot)
196
  </td>
197
- <td>80.47
198
  </td>
199
- <td>80.28
200
  </td>
201
  <td>99.8%
202
  </td>
@@ -204,9 +229,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
204
  <tr>
205
  <td>Winogrande (5-shot)
206
  </td>
207
- <td>78.06
208
  </td>
209
- <td>78.45
210
  </td>
211
  <td>100.5%
212
  </td>
@@ -214,9 +239,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
214
  <tr>
215
  <td>TruthfulQA (0-shot, mc2)
216
  </td>
217
- <td>54.48
218
  </td>
219
- <td>54.65
220
  </td>
221
  <td>100.3%
222
  </td>
@@ -224,13 +249,111 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
224
  <tr>
225
  <td><strong>Average</strong>
226
  </td>
227
- <td><strong>74.05</strong>
228
  </td>
229
- <td><strong>74.27</strong>
230
  </td>
231
  <td><strong>100.3%</strong>
232
  </td>
233
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
  </table>
235
 
236
  ### Reproduction
@@ -311,4 +434,38 @@ lm_eval \
311
  --tasks truthfulqa \
312
  --num_fewshot 0 \
313
  --batch_size auto
314
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  - **License(s):** Llama3.1
33
  - **Model Developers:** Neural Magic
34
 
35
+ This model is a quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
37
+ Meta-Llama-3.1-8B-Instruct-quantized.w8a8 achieves 105.4% recovery for the Arena-Hard evaluation, 100.3% for OpenLLM v1 (using Meta's prompting when available), 101.5% for OpenLLM v2, 99.7% for HumanEval pass@1, and 98.8% for HumanEval+ pass@1.
38
 
39
  ### Model Optimizations
40
 
 
132
 
133
  ## Evaluation
134
 
135
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
136
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
+
138
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
139
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
140
+ We report below the scores obtained in each judgement and the average.
141
+
142
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
143
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
144
+
145
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
146
+
147
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
148
 
149
  **Note:** Results have been updated after Meta modified the chat template.
150
 
 
162
  <td><strong>Recovery</strong>
163
  </td>
164
  </tr>
165
+ <tr>
166
+ <td><strong>Arena Hard</strong>
167
+ </td>
168
+ <td>25.8 (25.1 / 26.5)
169
+ </td>
170
+ <td>27.2 (27.6 / 26.7)
171
+ </td>
172
+ <td>105.4%
173
+ </td>
174
+ </tr>
175
+ <tr>
176
+ <td><strong>OpenLLM v1</strong>
177
+ </td>
178
+ </tr>
179
  <tr>
180
  <td>MMLU (5-shot)
181
  </td>
182
+ <td>68.3
183
  </td>
184
+ <td>67.8
185
  </td>
186
  <td>99.3%
187
  </td>
 
189
  <tr>
190
  <td>MMLU (CoT, 0-shot)
191
  </td>
192
+ <td>72.8
193
  </td>
194
+ <td>72.2
195
  </td>
196
  <td>99.1%
197
  </td>
 
199
  <tr>
200
  <td>ARC Challenge (0-shot)
201
  </td>
202
+ <td>81.4
203
  </td>
204
+ <td>81.7
205
  </td>
206
  <td>100.3%
207
  </td>
 
209
  <tr>
210
  <td>GSM-8K (CoT, 8-shot, strict-match)
211
  </td>
212
+ <td>82.8
213
  </td>
214
+ <td>84.8
215
  </td>
216
  <td>102.5%
217
  </td>
 
219
  <tr>
220
  <td>Hellaswag (10-shot)
221
  </td>
222
+ <td>80.5
223
  </td>
224
+ <td>80.3
225
  </td>
226
  <td>99.8%
227
  </td>
 
229
  <tr>
230
  <td>Winogrande (5-shot)
231
  </td>
232
+ <td>78.1
233
  </td>
234
+ <td>78.5
235
  </td>
236
  <td>100.5%
237
  </td>
 
239
  <tr>
240
  <td>TruthfulQA (0-shot, mc2)
241
  </td>
242
+ <td>54.5
243
  </td>
244
+ <td>54.7
245
  </td>
246
  <td>100.3%
247
  </td>
 
249
  <tr>
250
  <td><strong>Average</strong>
251
  </td>
252
+ <td><strong>74.1</strong>
253
  </td>
254
+ <td><strong>74.3</strong>
255
  </td>
256
  <td><strong>100.3%</strong>
257
  </td>
258
  </tr>
259
+ <tr>
260
+ <td><strong>OpenLLM v2</strong>
261
+ </td>
262
+ </tr>
263
+ <tr>
264
+ <td>MMLU-Pro (5-shot)
265
+ </td>
266
+ <td>30.8
267
+ </td>
268
+ <td>30.9
269
+ </td>
270
+ <td>100.3%
271
+ </td>
272
+ </tr>
273
+ <tr>
274
+ <td>IFEval (0-shot)
275
+ </td>
276
+ <td>77.9
277
+ </td>
278
+ <td>78.0
279
+ </td>
280
+ <td>100.1%
281
+ </td>
282
+ </tr>
283
+ <tr>
284
+ <td>BBH (3-shot)
285
+ </td>
286
+ <td>30.1
287
+ </td>
288
+ <td>31.0
289
+ </td>
290
+ <td>102.9%
291
+ </td>
292
+ </tr>
293
+ <tr>
294
+ <td>Math-|v|-5 (4-shot)
295
+ </td>
296
+ <td>15.7
297
+ </td>
298
+ <td>15.5
299
+ </td>
300
+ <td>98.9%
301
+ </td>
302
+ </tr>
303
+ <tr>
304
+ <td>GPQA (0-shot)
305
+ </td>
306
+ <td>3.7
307
+ </td>
308
+ <td>5.4
309
+ </td>
310
+ <td>146.2%
311
+ </td>
312
+ </tr>
313
+ <tr>
314
+ <td>MuSR (0-shot)
315
+ </td>
316
+ <td>7.6
317
+ </td>
318
+ <td>7.6
319
+ </td>
320
+ <td>100.0%
321
+ </td>
322
+ </tr>
323
+ <tr>
324
+ <td><strong>Average</strong>
325
+ </td>
326
+ <td><strong>27.6</strong>
327
+ </td>
328
+ <td><strong>28.0</strong>
329
+ </td>
330
+ <td><strong>101.5%</strong>
331
+ </td>
332
+ </tr>
333
+ <tr>
334
+ <td><strong>Coding</strong>
335
+ </td>
336
+ </tr>
337
+ <tr>
338
+ <td>HumanEval pass@1
339
+ </td>
340
+ <td>67.3
341
+ </td>
342
+ <td>67.1
343
+ </td>
344
+ <td>99.7%
345
+ </td>
346
+ </tr>
347
+ <tr>
348
+ <td>HumanEval+ pass@1
349
+ </td>
350
+ <td>60.7
351
+ </td>
352
+ <td>60.0
353
+ </td>
354
+ <td>98.8%
355
+ </td>
356
+ </tr>
357
  </table>
358
 
359
  ### Reproduction
 
434
  --tasks truthfulqa \
435
  --num_fewshot 0 \
436
  --batch_size auto
437
+ ```
438
+
439
+ #### OpenLLM v2
440
+ ```
441
+ lm_eval \
442
+ --model vllm \
443
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
444
+ --apply_chat_template \
445
+ --fewshot_as_multiturn \
446
+ --tasks leaderboard \
447
+ --batch_size auto
448
+ ```
449
+
450
+ #### HumanEval and HumanEval+
451
+ ##### Generation
452
+ ```
453
+ python3 codegen/generate.py \
454
+ --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 \
455
+ --bs 16 \
456
+ --temperature 0.2 \
457
+ --n_samples 50 \
458
+ --root "." \
459
+ --dataset humaneval
460
+ ```
461
+ ##### Sanitization
462
+ ```
463
+ python3 evalplus/sanitize.py \
464
+ humaneval/neuralmagic--Meta-Llama-3.1-8B-Instruct-quantized.w8a8_vllm_temp_0.2
465
+ ```
466
+ ##### Evaluation
467
+ ```
468
+ evalplus.evaluate \
469
+ --dataset humaneval \
470
+ --samples humaneval/neuralmagic--Meta-Llama-3.1-8B-Instruct-quantized.w8a8_vllm_temp_0.2-sanitized
471
+ ```