Smaug-Qwen2-72B-Instruct

image/png

Introduction

We introduce the latest in the Smaug series - a finetune of Qwen2-72B-Instruct

Compared to Qwen2-72B-Instruct, Smaug has better BBH, LiveCodeBench, and Arena-Hard scores (see evaluation results below).

How to use

The prompt format is unchanged from Qwen2-72B-Instruct.

Use with transformers

See the snippet below for usage with Transformers:

import transformers
import torch

model_id = "abacusai/Smaug-Qwen2-72B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

Evaluation Results

Big-Bench Hard (BBH)

Note: These results are with corrected parsing for BBH from Eleuther's lm-evaluation-harness. See this PR.

Overall:

Model Groups Version Filter n-shot Metric Value Stderr
Smaug-Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8241 ± 0.0042
Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8036 ± 0.0044

Breakdown:

Smaug-Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8241 0.0042
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6578 0.0348
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8280 0.0239
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3360 0.0299
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7120 0.0287
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.5320 0.0316
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9880 0.0069
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.7680 0.0268
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.5360 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.8493 0.0297
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.7560 0.0272
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8520 0.0225
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5920 0.0311
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.9101 0.0215
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9800 0.0089
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9560 0.0130
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6560 0.0301

Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8036 0.0044
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6684 0.0345
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3040 0.0292
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7480 0.0275
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.4960 0.0317
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.6800 0.0296
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.4720 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.7800 0.0263
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9760 0.0097
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9520 0.0135
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9480 0.0141
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.5753 0.0410
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.8120 0.0248
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8760 0.0209
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5880 0.0312
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.8764 0.0247
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9080 0.0183
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 0.9960 0.0040
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9160 0.0176
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9400 0.0151
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6680 0.0298

LiveCodeBench

Model Pass@1 Easy Pass@1 Medium Pass@1 Hard Pass@1
Smaug-Qwen2-72B-Instruct 0.3357 0.7286 0.1633 0.0000
Qwen2-72B-Instruct 0.3139 0.6810 0.1531 0.0000

Arena-Hard

Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

Model Score 95% Confidence Interval Average Tokens
GPT-4-Turbo-2024-04-09 82.6 (-1.8, 1.6) 662
GPT-4o 78.3 (-2.4, 2.1) 685
Gemini-1.5-pro-latest 72.1 (-2.3, 2.2) 630
Claude-3-Opus-20240229 60.4 (-3.3, 2.4) 541
Smaug-Llama-3-70B-Instruct 56.7 (-2.2, 2.6) 661
GPT-4-0314 50.0 (-0.0, 0.0) 423
Smaug-Qwen2-72B-Instruct 48.0 (-1.8, 2.1) 628
Claude-3-Sonnet-20240229 46.8 (-2.1, 2.2) 552
Qwen2-72B-Instruct 43.5 (-2.6, 2.7) 531
Llama-3-70B-Instruct 41.1 (-2.5, 2.4) 583
GPT-4-0613 37.9 (-2.2, 2.0) 354
Mistral-Large-2402 37.7 (-1.9, 2.6) 400
Mixtral-8x22B-Instruct-v0.1 36.4 (-2.7, 2.9) 430
Qwen1.5-72B-Chat 36.1 (-2.5, 2.2) 474
Command-R-Plus 33.1 (-2.1, 2.2) 541
Mistral-Medium 31.9 (-2.3, 2.4) 485
GPT-3.5-Turbo-0613 24.8 (-1.6, 2.0) 401

MT-Bench

First turn

Model Turn Score
Qwen2-72B-Instruct 1 9.18125
Smaug-Qwen2-72B-Instruct 1 9.05625

Second turn

Model Turn Score
Qwen2-72B-Instruct 2 8.74684
Smaug-Qwen2-72B-Instruct 2 8.67500

Average

Model Score
Qwen2-72B-Instruct 8.96541
Smaug-Qwen2-72B-Instruct 8.86563

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 41.08
IFEval (0-Shot) 78.25
BBH (3-Shot) 56.27
MATH Lvl 5 (4-Shot) 35.35
GPQA (0-shot) 14.88
MuSR (0-shot) 15.18
MMLU-PRO (5-shot) 46.56
Downloads last month
2,703
Safetensors
Model size
72.7B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for abacusai/Smaug-Qwen2-72B-Instruct

Quantizations
2 models

Evaluation results