ArkaAbacus's picture
Update README.md
a4e1166 verified
|
raw
history blame
13.6 kB
metadata
library_name: transformers
tags: []

Evaluation Results

Big-Bench Hard (BBH)

Note: These results are with corrected parsing for BBH from Eleuther's lm-evaluation-harness. See this PR.

Overall:

Model Groups Version Filter n-shot Metric Value Stderr
Smaug-Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8241 ± 0.0042
Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8036 ± 0.0044

Breakdown:

Smaug-Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8241 0.0042
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6578 0.0348
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8280 0.0239
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3360 0.0299
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7120 0.0287
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.5320 0.0316
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9880 0.0069
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.7680 0.0268
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.5360 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.8493 0.0297
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.7560 0.0272
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8520 0.0225
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5920 0.0311
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.9101 0.0215
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9800 0.0089
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9560 0.0130
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6560 0.0301

Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8036 0.0044
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6684 0.0345
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3040 0.0292
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7480 0.0275
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.4960 0.0317
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.6800 0.0296
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.4720 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.7800 0.0263
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9760 0.0097
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9520 0.0135
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9480 0.0141
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.5753 0.0410
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.8120 0.0248
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8760 0.0209
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5880 0.0312
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.8764 0.0247
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9080 0.0183
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 0.9960 0.0040
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9160 0.0176
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9400 0.0151
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6680 0.0298

LiveCodeBench

Model Pass@1 Easy Pass@1 Medium Pass@1 Hard Pass@1
Smaug-Qwen2-72B-Instruct 0.3357 0.7286 0.1633 0.0000
Qwen2-72B-Instruct 0.3139 0.6810 0.1531 0.0000

Arena-Hard

Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

Model Score 95% Confidence Interval Average Tokens
GPT-4-Turbo-2024-04-09 82.6 (-1.8, 1.6) 662
GPT-4o 78.3 (-2.4, 2.1) 685
Gemini-1.5-pro-latest 72.1 (-2.3, 2.2) 630
Claude-3-Opus-20240229 60.4 (-3.3, 2.4) 541
Smaug-Llama-3-70B-Instruct 56.7 (-2.2, 2.6) 661
GPT-4-0314 50.0 (-0.0, 0.0) 423
Smaug-Qwen2-72B-Instruct 48.0 (-1.8, 2.1) 628
Claude-3-Sonnet-20240229 46.8 (-2.1, 2.2) 552
Qwen2-72B-Instruct 43.5 (-2.6, 2.7) 531
Llama-3-70B-Instruct 41.1 (-2.5, 2.4) 583
GPT-4-0613 37.9 (-2.2, 2.0) 354
Mistral-Large-2402 37.7 (-1.9, 2.6) 400
Mixtral-8x22B-Instruct-v0.1 36.4 (-2.7, 2.9) 430
Qwen1.5-72B-Chat 36.1 (-2.5, 2.2) 474
Command-R-Plus 33.1 (-2.1, 2.2) 541
Mistral-Medium 31.9 (-2.3, 2.4) 485
GPT-3.5-Turbo-0613 24.8 (-1.6, 2.0) 401

MT-Bench

First turn

Model Turn Score
Qwen2-72B-Instruct 1 9.18125
Smaug-Qwen2-72B-Instruct 1 9.05625

Second turn

Model Turn Score
Qwen2-72B-Instruct 2 8.74684
Smaug-Qwen2-72B-Instruct 2 8.67500

Average

Model Score
Qwen2-72B-Instruct 8.96541
Smaug-Qwen2-72B-Instruct 8.86563

Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: [More Information Needed]
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: [More Information Needed]
  • Model type: [More Information Needed]
  • Language(s) (NLP): [More Information Needed]
  • License: [More Information Needed]
  • Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]