leaderboard-pr-bot's picture
Adding Evaluation Results
c2314fd verified
|
raw
history blame
15.3 kB
metadata
language:
  - en
license: other
tags:
  - chat
license_name: tongyi-qianwen
license_link: https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE
pipeline_tag: text-generation
model-index:
  - name: Smaug-Qwen2-72B-Instruct
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 78.25
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 56.27
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 35.35
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 14.88
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 15.18
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 46.56
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
          name: Open LLM Leaderboard

Smaug-Qwen2-72B-Instruct

image/png

Introduction

We introduce the latest in the Smaug series - a finetune of Qwen2-72B-Instruct

Compared to Qwen2-72B-Instruct, Smaug has better BBH, LiveCodeBench, and Arena-Hard scores (see evaluation results below).

How to use

The prompt format is unchanged from Qwen2-72B-Instruct.

Use with transformers

See the snippet below for usage with Transformers:

import transformers
import torch

model_id = "abacusai/Smaug-Qwen2-72B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

Evaluation Results

Big-Bench Hard (BBH)

Note: These results are with corrected parsing for BBH from Eleuther's lm-evaluation-harness. See this PR.

Overall:

Model Groups Version Filter n-shot Metric Value Stderr
Smaug-Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8241 ± 0.0042
Qwen2-72B-Instruct bbh N/A get-answer 3 exact_match 0.8036 ± 0.0044

Breakdown:

Smaug-Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8241 0.0042
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6578 0.0348
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8280 0.0239
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3360 0.0299
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7120 0.0287
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.5320 0.0316
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9880 0.0069
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.7680 0.0268
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.5360 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9720 0.0105
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.8493 0.0297
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.7560 0.0272
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8520 0.0225
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5920 0.0311
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.9101 0.0215
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9800 0.0089
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9560 0.0130
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6560 0.0301

Qwen2-72B-Instruct:

Tasks Version Filter n-shot Metric Value Stderr
bbh N/A get-answer 3 exact_match 0.8036 0.0044
- bbh_cot_fewshot_boolean_expressions 2 get-answer 3 exact_match 0.9640 0.0118
- bbh_cot_fewshot_causal_judgement 2 get-answer 3 exact_match 0.6684 0.0345
- bbh_cot_fewshot_date_understanding 2 get-answer 3 exact_match 0.8000 0.0253
- bbh_cot_fewshot_disambiguation_qa 2 get-answer 3 exact_match 0.8360 0.0235
- bbh_cot_fewshot_dyck_languages 2 get-answer 3 exact_match 0.3040 0.0292
- bbh_cot_fewshot_formal_fallacies 2 get-answer 3 exact_match 0.7480 0.0275
- bbh_cot_fewshot_geometric_shapes 2 get-answer 3 exact_match 0.4960 0.0317
- bbh_cot_fewshot_hyperbaton 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_logical_deduction_five_objects 2 get-answer 3 exact_match 0.6800 0.0296
- bbh_cot_fewshot_logical_deduction_seven_objects 2 get-answer 3 exact_match 0.4720 0.0316
- bbh_cot_fewshot_logical_deduction_three_objects 2 get-answer 3 exact_match 0.9200 0.0172
- bbh_cot_fewshot_movie_recommendation 2 get-answer 3 exact_match 0.7800 0.0263
- bbh_cot_fewshot_multistep_arithmetic_two 2 get-answer 3 exact_match 0.9760 0.0097
- bbh_cot_fewshot_navigate 2 get-answer 3 exact_match 0.9520 0.0135
- bbh_cot_fewshot_object_counting 2 get-answer 3 exact_match 0.9480 0.0141
- bbh_cot_fewshot_penguins_in_a_table 2 get-answer 3 exact_match 0.5753 0.0410
- bbh_cot_fewshot_reasoning_about_colored_objects 2 get-answer 3 exact_match 0.8120 0.0248
- bbh_cot_fewshot_ruin_names 2 get-answer 3 exact_match 0.8760 0.0209
- bbh_cot_fewshot_salient_translation_error_detection 2 get-answer 3 exact_match 0.5880 0.0312
- bbh_cot_fewshot_snarks 2 get-answer 3 exact_match 0.8764 0.0247
- bbh_cot_fewshot_sports_understanding 2 get-answer 3 exact_match 0.9080 0.0183
- bbh_cot_fewshot_temporal_sequences 2 get-answer 3 exact_match 0.9960 0.0040
- bbh_cot_fewshot_tracking_shuffled_objects_five_objects 2 get-answer 3 exact_match 0.9160 0.0176
- bbh_cot_fewshot_tracking_shuffled_objects_seven_objects 2 get-answer 3 exact_match 0.9400 0.0151
- bbh_cot_fewshot_tracking_shuffled_objects_three_objects 2 get-answer 3 exact_match 0.9440 0.0146
- bbh_cot_fewshot_web_of_lies 2 get-answer 3 exact_match 1.0000 0.0000
- bbh_cot_fewshot_word_sorting 2 get-answer 3 exact_match 0.6680 0.0298

LiveCodeBench

Model Pass@1 Easy Pass@1 Medium Pass@1 Hard Pass@1
Smaug-Qwen2-72B-Instruct 0.3357 0.7286 0.1633 0.0000
Qwen2-72B-Instruct 0.3139 0.6810 0.1531 0.0000

Arena-Hard

Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

Model Score 95% Confidence Interval Average Tokens
GPT-4-Turbo-2024-04-09 82.6 (-1.8, 1.6) 662
GPT-4o 78.3 (-2.4, 2.1) 685
Gemini-1.5-pro-latest 72.1 (-2.3, 2.2) 630
Claude-3-Opus-20240229 60.4 (-3.3, 2.4) 541
Smaug-Llama-3-70B-Instruct 56.7 (-2.2, 2.6) 661
GPT-4-0314 50.0 (-0.0, 0.0) 423
Smaug-Qwen2-72B-Instruct 48.0 (-1.8, 2.1) 628
Claude-3-Sonnet-20240229 46.8 (-2.1, 2.2) 552
Qwen2-72B-Instruct 43.5 (-2.6, 2.7) 531
Llama-3-70B-Instruct 41.1 (-2.5, 2.4) 583
GPT-4-0613 37.9 (-2.2, 2.0) 354
Mistral-Large-2402 37.7 (-1.9, 2.6) 400
Mixtral-8x22B-Instruct-v0.1 36.4 (-2.7, 2.9) 430
Qwen1.5-72B-Chat 36.1 (-2.5, 2.2) 474
Command-R-Plus 33.1 (-2.1, 2.2) 541
Mistral-Medium 31.9 (-2.3, 2.4) 485
GPT-3.5-Turbo-0613 24.8 (-1.6, 2.0) 401

MT-Bench

First turn

Model Turn Score
Qwen2-72B-Instruct 1 9.18125
Smaug-Qwen2-72B-Instruct 1 9.05625

Second turn

Model Turn Score
Qwen2-72B-Instruct 2 8.74684
Smaug-Qwen2-72B-Instruct 2 8.67500

Average

Model Score
Qwen2-72B-Instruct 8.96541
Smaug-Qwen2-72B-Instruct 8.86563

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 41.08
IFEval (0-Shot) 78.25
BBH (3-Shot) 56.27
MATH Lvl 5 (4-Shot) 35.35
GPQA (0-shot) 14.88
MuSR (0-shot) 15.18
MMLU-PRO (5-shot) 46.56