metadata

library_name: transformers
tags: []

Evaluation Results

Big-Bench Hard (BBH)

Note: These results are with corrected parsing for BBH from Eleuther's lm-evaluation-harness. See this PR.

Overall:

Model	Groups	Version	Filter	n-shot	Metric	Value		Stderr
Smaug-Qwen2-72B-Instruct	bbh	N/A	get-answer	3	exact_match	0.8241	±	0.0042
Qwen2-72B-Instruct	bbh	N/A	get-answer	3	exact_match	0.8036	±	0.0044

Breakdown:

Smaug-Qwen2-72B-Instruct:

Tasks	Version	Filter	n-shot	Metric	Value	Stderr
bbh	N/A	get-answer	3	exact_match	0.8241	0.0042
- bbh_cot_fewshot_boolean_expressions	2	get-answer	3	exact_match	0.9640	0.0118
- bbh_cot_fewshot_causal_judgement	2	get-answer	3	exact_match	0.6578	0.0348
- bbh_cot_fewshot_date_understanding	2	get-answer	3	exact_match	0.8360	0.0235
- bbh_cot_fewshot_disambiguation_qa	2	get-answer	3	exact_match	0.8280	0.0239
- bbh_cot_fewshot_dyck_languages	2	get-answer	3	exact_match	0.3360	0.0299
- bbh_cot_fewshot_formal_fallacies	2	get-answer	3	exact_match	0.7120	0.0287
- bbh_cot_fewshot_geometric_shapes	2	get-answer	3	exact_match	0.5320	0.0316
- bbh_cot_fewshot_hyperbaton	2	get-answer	3	exact_match	0.9880	0.0069
- bbh_cot_fewshot_logical_deduction_five_objects	2	get-answer	3	exact_match	0.7680	0.0268
- bbh_cot_fewshot_logical_deduction_seven_objects	2	get-answer	3	exact_match	0.5360	0.0316
- bbh_cot_fewshot_logical_deduction_three_objects	2	get-answer	3	exact_match	0.9720	0.0105
- bbh_cot_fewshot_movie_recommendation	2	get-answer	3	exact_match	0.8000	0.0253
- bbh_cot_fewshot_multistep_arithmetic_two	2	get-answer	3	exact_match	0.9720	0.0105
- bbh_cot_fewshot_navigate	2	get-answer	3	exact_match	0.9640	0.0118
- bbh_cot_fewshot_object_counting	2	get-answer	3	exact_match	0.9200	0.0172
- bbh_cot_fewshot_penguins_in_a_table	2	get-answer	3	exact_match	0.8493	0.0297
- bbh_cot_fewshot_reasoning_about_colored_objects	2	get-answer	3	exact_match	0.7560	0.0272
- bbh_cot_fewshot_ruin_names	2	get-answer	3	exact_match	0.8520	0.0225
- bbh_cot_fewshot_salient_translation_error_detection	2	get-answer	3	exact_match	0.5920	0.0311
- bbh_cot_fewshot_snarks	2	get-answer	3	exact_match	0.9101	0.0215
- bbh_cot_fewshot_sports_understanding	2	get-answer	3	exact_match	0.9440	0.0146
- bbh_cot_fewshot_temporal_sequences	2	get-answer	3	exact_match	1.0000	0.0000
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects	2	get-answer	3	exact_match	0.9800	0.0089
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects	2	get-answer	3	exact_match	0.9560	0.0130
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects	2	get-answer	3	exact_match	0.9640	0.0118
- bbh_cot_fewshot_web_of_lies	2	get-answer	3	exact_match	1.0000	0.0000
- bbh_cot_fewshot_word_sorting	2	get-answer	3	exact_match	0.6560	0.0301

Qwen2-72B-Instruct:

Tasks	Version	Filter	n-shot	Metric	Value	Stderr
bbh	N/A	get-answer	3	exact_match	0.8036	0.0044
- bbh_cot_fewshot_boolean_expressions	2	get-answer	3	exact_match	0.9640	0.0118
- bbh_cot_fewshot_causal_judgement	2	get-answer	3	exact_match	0.6684	0.0345
- bbh_cot_fewshot_date_understanding	2	get-answer	3	exact_match	0.8000	0.0253
- bbh_cot_fewshot_disambiguation_qa	2	get-answer	3	exact_match	0.8360	0.0235
- bbh_cot_fewshot_dyck_languages	2	get-answer	3	exact_match	0.3040	0.0292
- bbh_cot_fewshot_formal_fallacies	2	get-answer	3	exact_match	0.7480	0.0275
- bbh_cot_fewshot_geometric_shapes	2	get-answer	3	exact_match	0.4960	0.0317
- bbh_cot_fewshot_hyperbaton	2	get-answer	3	exact_match	0.9440	0.0146
- bbh_cot_fewshot_logical_deduction_five_objects	2	get-answer	3	exact_match	0.6800	0.0296
- bbh_cot_fewshot_logical_deduction_seven_objects	2	get-answer	3	exact_match	0.4720	0.0316
- bbh_cot_fewshot_logical_deduction_three_objects	2	get-answer	3	exact_match	0.9200	0.0172
- bbh_cot_fewshot_movie_recommendation	2	get-answer	3	exact_match	0.7800	0.0263
- bbh_cot_fewshot_multistep_arithmetic_two	2	get-answer	3	exact_match	0.9760	0.0097
- bbh_cot_fewshot_navigate	2	get-answer	3	exact_match	0.9520	0.0135
- bbh_cot_fewshot_object_counting	2	get-answer	3	exact_match	0.9480	0.0141
- bbh_cot_fewshot_penguins_in_a_table	2	get-answer	3	exact_match	0.5753	0.0410
- bbh_cot_fewshot_reasoning_about_colored_objects	2	get-answer	3	exact_match	0.8120	0.0248
- bbh_cot_fewshot_ruin_names	2	get-answer	3	exact_match	0.8760	0.0209
- bbh_cot_fewshot_salient_translation_error_detection	2	get-answer	3	exact_match	0.5880	0.0312
- bbh_cot_fewshot_snarks	2	get-answer	3	exact_match	0.8764	0.0247
- bbh_cot_fewshot_sports_understanding	2	get-answer	3	exact_match	0.9080	0.0183
- bbh_cot_fewshot_temporal_sequences	2	get-answer	3	exact_match	0.9960	0.0040
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects	2	get-answer	3	exact_match	0.9160	0.0176
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects	2	get-answer	3	exact_match	0.9400	0.0151
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects	2	get-answer	3	exact_match	0.9440	0.0146
- bbh_cot_fewshot_web_of_lies	2	get-answer	3	exact_match	1.0000	0.0000
- bbh_cot_fewshot_word_sorting	2	get-answer	3	exact_match	0.6680	0.0298

LiveCodeBench

Model	Pass@1	Easy Pass@1	Medium Pass@1	Hard Pass@1
Smaug-Qwen2-72B-Instruct	0.3357	0.7286	0.1633	0.0000
Qwen2-72B-Instruct	0.3139	0.6810	0.1531	0.0000

Arena-Hard

Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

Model	Score	95% Confidence Interval	Average Tokens
GPT-4-Turbo-2024-04-09	82.6	(-1.8, 1.6)	662
GPT-4o	78.3	(-2.4, 2.1)	685
Gemini-1.5-pro-latest	72.1	(-2.3, 2.2)	630
Claude-3-Opus-20240229	60.4	(-3.3, 2.4)	541
Smaug-Llama-3-70B-Instruct	56.7	(-2.2, 2.6)	661
GPT-4-0314	50.0	(-0.0, 0.0)	423
Smaug-Qwen2-72B-Instruct	48.0	(-1.8, 2.1)	628
Claude-3-Sonnet-20240229	46.8	(-2.1, 2.2)	552
Qwen2-72B-Instruct	43.5	(-2.6, 2.7)	531
Llama-3-70B-Instruct	41.1	(-2.5, 2.4)	583
GPT-4-0613	37.9	(-2.2, 2.0)	354
Mistral-Large-2402	37.7	(-1.9, 2.6)	400
Mixtral-8x22B-Instruct-v0.1	36.4	(-2.7, 2.9)	430
Qwen1.5-72B-Chat	36.1	(-2.5, 2.2)	474
Command-R-Plus	33.1	(-2.1, 2.2)	541
Mistral-Medium	31.9	(-2.3, 2.4)	485
GPT-3.5-Turbo-0613	24.8	(-1.6, 2.0)	401

MT-Bench

First turn

Model	Turn	Score
Qwen2-72B-Instruct	1	9.18125
Smaug-Qwen2-72B-Instruct	1	9.05625

Second turn

Model	Turn	Score
Qwen2-72B-Instruct	2	8.74684
Smaug-Qwen2-72B-Instruct	2	8.67500

Average

Model	Score
Qwen2-72B-Instruct	8.96541
Smaug-Qwen2-72B-Instruct	8.86563

Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: [More Information Needed]
Funded by [optional]: [More Information Needed]
Shared by [optional]: [More Information Needed]
Model type: [More Information Needed]
Language(s) (NLP): [More Information Needed]
License: [More Information Needed]
Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]