library_name: transformers
tags: []
Evaluation Results
Big-Bench Hard (BBH)
Note: These results are with corrected parsing for BBH from Eleuther's lm-evaluation-harness. See this PR.
Overall:
Model | Groups | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
Smaug-Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8241 | ± | 0.0042 |
Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8036 | ± | 0.0044 |
Breakdown:
Smaug-Qwen2-72B-Instruct:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr |
---|---|---|---|---|---|---|
bbh | N/A | get-answer | 3 | exact_match | 0.8241 | 0.0042 |
- bbh_cot_fewshot_boolean_expressions | 2 | get-answer | 3 | exact_match | 0.9640 | 0.0118 |
- bbh_cot_fewshot_causal_judgement | 2 | get-answer | 3 | exact_match | 0.6578 | 0.0348 |
- bbh_cot_fewshot_date_understanding | 2 | get-answer | 3 | exact_match | 0.8360 | 0.0235 |
- bbh_cot_fewshot_disambiguation_qa | 2 | get-answer | 3 | exact_match | 0.8280 | 0.0239 |
- bbh_cot_fewshot_dyck_languages | 2 | get-answer | 3 | exact_match | 0.3360 | 0.0299 |
- bbh_cot_fewshot_formal_fallacies | 2 | get-answer | 3 | exact_match | 0.7120 | 0.0287 |
- bbh_cot_fewshot_geometric_shapes | 2 | get-answer | 3 | exact_match | 0.5320 | 0.0316 |
- bbh_cot_fewshot_hyperbaton | 2 | get-answer | 3 | exact_match | 0.9880 | 0.0069 |
- bbh_cot_fewshot_logical_deduction_five_objects | 2 | get-answer | 3 | exact_match | 0.7680 | 0.0268 |
- bbh_cot_fewshot_logical_deduction_seven_objects | 2 | get-answer | 3 | exact_match | 0.5360 | 0.0316 |
- bbh_cot_fewshot_logical_deduction_three_objects | 2 | get-answer | 3 | exact_match | 0.9720 | 0.0105 |
- bbh_cot_fewshot_movie_recommendation | 2 | get-answer | 3 | exact_match | 0.8000 | 0.0253 |
- bbh_cot_fewshot_multistep_arithmetic_two | 2 | get-answer | 3 | exact_match | 0.9720 | 0.0105 |
- bbh_cot_fewshot_navigate | 2 | get-answer | 3 | exact_match | 0.9640 | 0.0118 |
- bbh_cot_fewshot_object_counting | 2 | get-answer | 3 | exact_match | 0.9200 | 0.0172 |
- bbh_cot_fewshot_penguins_in_a_table | 2 | get-answer | 3 | exact_match | 0.8493 | 0.0297 |
- bbh_cot_fewshot_reasoning_about_colored_objects | 2 | get-answer | 3 | exact_match | 0.7560 | 0.0272 |
- bbh_cot_fewshot_ruin_names | 2 | get-answer | 3 | exact_match | 0.8520 | 0.0225 |
- bbh_cot_fewshot_salient_translation_error_detection | 2 | get-answer | 3 | exact_match | 0.5920 | 0.0311 |
- bbh_cot_fewshot_snarks | 2 | get-answer | 3 | exact_match | 0.9101 | 0.0215 |
- bbh_cot_fewshot_sports_understanding | 2 | get-answer | 3 | exact_match | 0.9440 | 0.0146 |
- bbh_cot_fewshot_temporal_sequences | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 2 | get-answer | 3 | exact_match | 0.9800 | 0.0089 |
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects | 2 | get-answer | 3 | exact_match | 0.9560 | 0.0130 |
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects | 2 | get-answer | 3 | exact_match | 0.9640 | 0.0118 |
- bbh_cot_fewshot_web_of_lies | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
- bbh_cot_fewshot_word_sorting | 2 | get-answer | 3 | exact_match | 0.6560 | 0.0301 |
Qwen2-72B-Instruct:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr |
---|---|---|---|---|---|---|
bbh | N/A | get-answer | 3 | exact_match | 0.8036 | 0.0044 |
- bbh_cot_fewshot_boolean_expressions | 2 | get-answer | 3 | exact_match | 0.9640 | 0.0118 |
- bbh_cot_fewshot_causal_judgement | 2 | get-answer | 3 | exact_match | 0.6684 | 0.0345 |
- bbh_cot_fewshot_date_understanding | 2 | get-answer | 3 | exact_match | 0.8000 | 0.0253 |
- bbh_cot_fewshot_disambiguation_qa | 2 | get-answer | 3 | exact_match | 0.8360 | 0.0235 |
- bbh_cot_fewshot_dyck_languages | 2 | get-answer | 3 | exact_match | 0.3040 | 0.0292 |
- bbh_cot_fewshot_formal_fallacies | 2 | get-answer | 3 | exact_match | 0.7480 | 0.0275 |
- bbh_cot_fewshot_geometric_shapes | 2 | get-answer | 3 | exact_match | 0.4960 | 0.0317 |
- bbh_cot_fewshot_hyperbaton | 2 | get-answer | 3 | exact_match | 0.9440 | 0.0146 |
- bbh_cot_fewshot_logical_deduction_five_objects | 2 | get-answer | 3 | exact_match | 0.6800 | 0.0296 |
- bbh_cot_fewshot_logical_deduction_seven_objects | 2 | get-answer | 3 | exact_match | 0.4720 | 0.0316 |
- bbh_cot_fewshot_logical_deduction_three_objects | 2 | get-answer | 3 | exact_match | 0.9200 | 0.0172 |
- bbh_cot_fewshot_movie_recommendation | 2 | get-answer | 3 | exact_match | 0.7800 | 0.0263 |
- bbh_cot_fewshot_multistep_arithmetic_two | 2 | get-answer | 3 | exact_match | 0.9760 | 0.0097 |
- bbh_cot_fewshot_navigate | 2 | get-answer | 3 | exact_match | 0.9520 | 0.0135 |
- bbh_cot_fewshot_object_counting | 2 | get-answer | 3 | exact_match | 0.9480 | 0.0141 |
- bbh_cot_fewshot_penguins_in_a_table | 2 | get-answer | 3 | exact_match | 0.5753 | 0.0410 |
- bbh_cot_fewshot_reasoning_about_colored_objects | 2 | get-answer | 3 | exact_match | 0.8120 | 0.0248 |
- bbh_cot_fewshot_ruin_names | 2 | get-answer | 3 | exact_match | 0.8760 | 0.0209 |
- bbh_cot_fewshot_salient_translation_error_detection | 2 | get-answer | 3 | exact_match | 0.5880 | 0.0312 |
- bbh_cot_fewshot_snarks | 2 | get-answer | 3 | exact_match | 0.8764 | 0.0247 |
- bbh_cot_fewshot_sports_understanding | 2 | get-answer | 3 | exact_match | 0.9080 | 0.0183 |
- bbh_cot_fewshot_temporal_sequences | 2 | get-answer | 3 | exact_match | 0.9960 | 0.0040 |
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 2 | get-answer | 3 | exact_match | 0.9160 | 0.0176 |
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects | 2 | get-answer | 3 | exact_match | 0.9400 | 0.0151 |
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects | 2 | get-answer | 3 | exact_match | 0.9440 | 0.0146 |
- bbh_cot_fewshot_web_of_lies | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
- bbh_cot_fewshot_word_sorting | 2 | get-answer | 3 | exact_match | 0.6680 | 0.0298 |
LiveCodeBench
Model | Pass@1 | Easy Pass@1 | Medium Pass@1 | Hard Pass@1 |
---|---|---|---|---|
Smaug-Qwen2-72B-Instruct | 0.3357 | 0.7286 | 0.1633 | 0.0000 |
Qwen2-72B-Instruct | 0.3139 | 0.6810 | 0.1531 | 0.0000 |
Arena-Hard
Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.
Model | Score | 95% Confidence Interval | Average Tokens |
---|---|---|---|
GPT-4-Turbo-2024-04-09 | 82.6 | (-1.8, 1.6) | 662 |
GPT-4o | 78.3 | (-2.4, 2.1) | 685 |
Gemini-1.5-pro-latest | 72.1 | (-2.3, 2.2) | 630 |
Claude-3-Opus-20240229 | 60.4 | (-3.3, 2.4) | 541 |
Smaug-Llama-3-70B-Instruct | 56.7 | (-2.2, 2.6) | 661 |
GPT-4-0314 | 50.0 | (-0.0, 0.0) | 423 |
Smaug-Qwen2-72B-Instruct | 48.0 | (-1.8, 2.1) | 628 |
Claude-3-Sonnet-20240229 | 46.8 | (-2.1, 2.2) | 552 |
Qwen2-72B-Instruct | 43.5 | (-2.6, 2.7) | 531 |
Llama-3-70B-Instruct | 41.1 | (-2.5, 2.4) | 583 |
GPT-4-0613 | 37.9 | (-2.2, 2.0) | 354 |
Mistral-Large-2402 | 37.7 | (-1.9, 2.6) | 400 |
Mixtral-8x22B-Instruct-v0.1 | 36.4 | (-2.7, 2.9) | 430 |
Qwen1.5-72B-Chat | 36.1 | (-2.5, 2.2) | 474 |
Command-R-Plus | 33.1 | (-2.1, 2.2) | 541 |
Mistral-Medium | 31.9 | (-2.3, 2.4) | 485 |
GPT-3.5-Turbo-0613 | 24.8 | (-1.6, 2.0) | 401 |
MT-Bench
First turn
Model | Turn | Score |
---|---|---|
Qwen2-72B-Instruct | 1 | 9.18125 |
Smaug-Qwen2-72B-Instruct | 1 | 9.05625 |
Second turn
Model | Turn | Score |
---|---|---|
Qwen2-72B-Instruct | 2 | 8.74684 |
Smaug-Qwen2-72B-Instruct | 2 | 8.67500 |
Average
Model | Score |
---|---|
Qwen2-72B-Instruct | 8.96541 |
Smaug-Qwen2-72B-Instruct | 8.86563 |
Model Card for Model ID
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: [More Information Needed]
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: [More Information Needed]
- Language(s) (NLP): [More Information Needed]
- License: [More Information Needed]
- Finetuned from model [optional]: [More Information Needed]
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
[More Information Needed]
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times [optional]
[More Information Needed]