Adding Evaluation Results (#1)

df3d9a1 verified 4 months ago

15.3 kB

	---
	language:
	- en
	license: other
	tags:
	- chat
	license_name: tongyi-qianwen
	license_link: https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE
	pipeline_tag: text-generation
	model-index:
	- name: Smaug-Qwen2-72B-Instruct
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 78.25
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 56.27
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 35.35
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 14.88
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 15.18
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 46.56
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Qwen2-72B-Instruct
	name: Open LLM Leaderboard
	---

	# Smaug-Qwen2-72B-Instruct

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/NtH_6eS-yyuEgbKeiek1_.png)

	# Introduction

	We introduce the latest in the Smaug series - a finetune of [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct)

	Compared to Qwen2-72B-Instruct, Smaug has better BBH, LiveCodeBench, and Arena-Hard scores (see evaluation results below).

	## How to use

	The prompt format is unchanged from Qwen2-72B-Instruct.

	### Use with transformers

	See the snippet below for usage with Transformers:

	```python
	import transformers
	import torch

	model_id = "abacusai/Smaug-Qwen2-72B-Instruct"

	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="auto",
	)

	messages = [
	{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
	{"role": "user", "content": "Who are you?"},
	]

	prompt = pipeline.tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	terminators = [
	pipeline.tokenizer.eos_token_id,
	pipeline.tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	outputs = pipeline(
	prompt,
	max_new_tokens=256,
	eos_token_id=terminators,
	do_sample=True,
	temperature=0.6,
	top_p=0.9,
	)
	print(outputs[0]["generated_text"][len(prompt):])
	```

	# Evaluation Results

	## Big-Bench Hard (BBH)

	Note: These results are with corrected parsing for BBH from Eleuther's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). See [this PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/2013).

	#### Overall:

	\| Model \| Groups \| Version \| Filter \| n-shot \| Metric \| Value \| \| Stderr \|
	\|----------------------------\|--------\|---------\|------------\|--------\|-------------\|--------\|---\|--------\|
	\| Smaug-Qwen2-72B-Instruct \| bbh \| N/A \| get-answer \| 3 \| exact_match \| 0.8241 \| ± \| 0.0042 \|
	\| Qwen2-72B-Instruct \| bbh \| N/A \| get-answer \| 3 \| exact_match \| 0.8036 \| ± \| 0.0044 \|

	#### Breakdown:

	Smaug-Qwen2-72B-Instruct:

	\| Tasks \| Version \| Filter \| n-shot \| Metric \| Value \| Stderr \|
	\|-----------------------------------------------------------\|---------\|------------\|--------\|-------------\|--------\|--------\|
	\| bbh \| N/A \| get-answer \| 3 \| exact_match \| 0.8241 \| 0.0042 \|
	\| - bbh_cot_fewshot_boolean_expressions \| 2 \| get-answer \| 3 \| exact_match \| 0.9640 \| 0.0118 \|
	\| - bbh_cot_fewshot_causal_judgement \| 2 \| get-answer \| 3 \| exact_match \| 0.6578 \| 0.0348 \|
	\| - bbh_cot_fewshot_date_understanding \| 2 \| get-answer \| 3 \| exact_match \| 0.8360 \| 0.0235 \|
	\| - bbh_cot_fewshot_disambiguation_qa \| 2 \| get-answer \| 3 \| exact_match \| 0.8280 \| 0.0239 \|
	\| - bbh_cot_fewshot_dyck_languages \| 2 \| get-answer \| 3 \| exact_match \| 0.3360 \| 0.0299 \|
	\| - bbh_cot_fewshot_formal_fallacies \| 2 \| get-answer \| 3 \| exact_match \| 0.7120 \| 0.0287 \|
	\| - bbh_cot_fewshot_geometric_shapes \| 2 \| get-answer \| 3 \| exact_match \| 0.5320 \| 0.0316 \|
	\| - bbh_cot_fewshot_hyperbaton \| 2 \| get-answer \| 3 \| exact_match \| 0.9880 \| 0.0069 \|
	\| - bbh_cot_fewshot_logical_deduction_five_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.7680 \| 0.0268 \|
	\| - bbh_cot_fewshot_logical_deduction_seven_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.5360 \| 0.0316 \|
	\| - bbh_cot_fewshot_logical_deduction_three_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9720 \| 0.0105 \|
	\| - bbh_cot_fewshot_movie_recommendation \| 2 \| get-answer \| 3 \| exact_match \| 0.8000 \| 0.0253 \|
	\| - bbh_cot_fewshot_multistep_arithmetic_two \| 2 \| get-answer \| 3 \| exact_match \| 0.9720 \| 0.0105 \|
	\| - bbh_cot_fewshot_navigate \| 2 \| get-answer \| 3 \| exact_match \| 0.9640 \| 0.0118 \|
	\| - bbh_cot_fewshot_object_counting \| 2 \| get-answer \| 3 \| exact_match \| 0.9200 \| 0.0172 \|
	\| - bbh_cot_fewshot_penguins_in_a_table \| 2 \| get-answer \| 3 \| exact_match \| 0.8493 \| 0.0297 \|
	\| - bbh_cot_fewshot_reasoning_about_colored_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.7560 \| 0.0272 \|
	\| - bbh_cot_fewshot_ruin_names \| 2 \| get-answer \| 3 \| exact_match \| 0.8520 \| 0.0225 \|
	\| - bbh_cot_fewshot_salient_translation_error_detection \| 2 \| get-answer \| 3 \| exact_match \| 0.5920 \| 0.0311 \|
	\| - bbh_cot_fewshot_snarks \| 2 \| get-answer \| 3 \| exact_match \| 0.9101 \| 0.0215 \|
	\| - bbh_cot_fewshot_sports_understanding \| 2 \| get-answer \| 3 \| exact_match \| 0.9440 \| 0.0146 \|
	\| - bbh_cot_fewshot_temporal_sequences \| 2 \| get-answer \| 3 \| exact_match \| 1.0000 \| 0.0000 \|
	\| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9800 \| 0.0089 \|
	\| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9560 \| 0.0130 \|
	\| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9640 \| 0.0118 \|
	\| - bbh_cot_fewshot_web_of_lies \| 2 \| get-answer \| 3 \| exact_match \| 1.0000 \| 0.0000 \|
	\| - bbh_cot_fewshot_word_sorting \| 2 \| get-answer \| 3 \| exact_match \| 0.6560 \| 0.0301 \|

	Qwen2-72B-Instruct:

	\| Tasks \| Version \| Filter \| n-shot \| Metric \| Value \| Stderr \|
	\|-----------------------------------------------------------\|---------\|------------\|--------\|-------------\|--------\|--------\|
	\| bbh \| N/A \| get-answer \| 3 \| exact_match \| 0.8036 \| 0.0044 \|
	\| - bbh_cot_fewshot_boolean_expressions \| 2 \| get-answer \| 3 \| exact_match \| 0.9640 \| 0.0118 \|
	\| - bbh_cot_fewshot_causal_judgement \| 2 \| get-answer \| 3 \| exact_match \| 0.6684 \| 0.0345 \|
	\| - bbh_cot_fewshot_date_understanding \| 2 \| get-answer \| 3 \| exact_match \| 0.8000 \| 0.0253 \|
	\| - bbh_cot_fewshot_disambiguation_qa \| 2 \| get-answer \| 3 \| exact_match \| 0.8360 \| 0.0235 \|
	\| - bbh_cot_fewshot_dyck_languages \| 2 \| get-answer \| 3 \| exact_match \| 0.3040 \| 0.0292 \|
	\| - bbh_cot_fewshot_formal_fallacies \| 2 \| get-answer \| 3 \| exact_match \| 0.7480 \| 0.0275 \|
	\| - bbh_cot_fewshot_geometric_shapes \| 2 \| get-answer \| 3 \| exact_match \| 0.4960 \| 0.0317 \|
	\| - bbh_cot_fewshot_hyperbaton \| 2 \| get-answer \| 3 \| exact_match \| 0.9440 \| 0.0146 \|
	\| - bbh_cot_fewshot_logical_deduction_five_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.6800 \| 0.0296 \|
	\| - bbh_cot_fewshot_logical_deduction_seven_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.4720 \| 0.0316 \|
	\| - bbh_cot_fewshot_logical_deduction_three_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9200 \| 0.0172 \|
	\| - bbh_cot_fewshot_movie_recommendation \| 2 \| get-answer \| 3 \| exact_match \| 0.7800 \| 0.0263 \|
	\| - bbh_cot_fewshot_multistep_arithmetic_two \| 2 \| get-answer \| 3 \| exact_match \| 0.9760 \| 0.0097 \|
	\| - bbh_cot_fewshot_navigate \| 2 \| get-answer \| 3 \| exact_match \| 0.9520 \| 0.0135 \|
	\| - bbh_cot_fewshot_object_counting \| 2 \| get-answer \| 3 \| exact_match \| 0.9480 \| 0.0141 \|
	\| - bbh_cot_fewshot_penguins_in_a_table \| 2 \| get-answer \| 3 \| exact_match \| 0.5753 \| 0.0410 \|
	\| - bbh_cot_fewshot_reasoning_about_colored_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.8120 \| 0.0248 \|
	\| - bbh_cot_fewshot_ruin_names \| 2 \| get-answer \| 3 \| exact_match \| 0.8760 \| 0.0209 \|
	\| - bbh_cot_fewshot_salient_translation_error_detection \| 2 \| get-answer \| 3 \| exact_match \| 0.5880 \| 0.0312 \|
	\| - bbh_cot_fewshot_snarks \| 2 \| get-answer \| 3 \| exact_match \| 0.8764 \| 0.0247 \|
	\| - bbh_cot_fewshot_sports_understanding \| 2 \| get-answer \| 3 \| exact_match \| 0.9080 \| 0.0183 \|
	\| - bbh_cot_fewshot_temporal_sequences \| 2 \| get-answer \| 3 \| exact_match \| 0.9960 \| 0.0040 \|
	\| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9160 \| 0.0176 \|
	\| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9400 \| 0.0151 \|
	\| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects \| 2 \| get-answer \| 3 \| exact_match \| 0.9440 \| 0.0146 \|
	\| - bbh_cot_fewshot_web_of_lies \| 2 \| get-answer \| 3 \| exact_match \| 1.0000 \| 0.0000 \|
	\| - bbh_cot_fewshot_word_sorting \| 2 \| get-answer \| 3 \| exact_match \| 0.6680 \| 0.0298 \|

	## LiveCodeBench

	\| Model \| Pass@1 \| Easy Pass@1 \| Medium Pass@1 \| Hard Pass@1 \|
	\|--------------------------\|--------\|-------------\|---------------\|-------------\|
	\| Smaug-Qwen2-72B-Instruct \| 0.3357 \| 0.7286 \| 0.1633 \| 0.0000 \|
	\| Qwen2-72B-Instruct \| 0.3139 \| 0.6810 \| 0.1531 \| 0.0000 \|


	## Arena-Hard

	Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

	\| Model \| Score \| 95% Confidence Interval \| Average Tokens \|
	\| :---- \| ---------: \| ----------: \| ------: \|
	\| GPT-4-Turbo-2024-04-09 \| 82.6 \| (-1.8, 1.6) \| 662 \|
	\| GPT-4o \| 78.3 \| (-2.4, 2.1) \| 685 \|
	\| Gemini-1.5-pro-latest \| 72.1 \| (-2.3, 2.2) \| 630 \|
	\| Claude-3-Opus-20240229 \| 60.4 \| (-3.3, 2.4) \| 541 \|
	\| Smaug-Llama-3-70B-Instruct \| 56.7 \| (-2.2, 2.6) \| 661 \|
	\| GPT-4-0314 \| 50.0 \| (-0.0, 0.0) \| 423 \|
	\| Smaug-Qwen2-72B-Instruct \| 48.0 \| (-1.8, 2.1) \| 628 \|
	\| Claude-3-Sonnet-20240229 \| 46.8 \| (-2.1, 2.2) \| 552 \|
	\| Qwen2-72B-Instruct \| 43.5 \| (-2.6, 2.7) \| 531 \|
	\| Llama-3-70B-Instruct \| 41.1 \| (-2.5, 2.4) \| 583 \|
	\| GPT-4-0613 \| 37.9 \| (-2.2, 2.0) \| 354 \|
	\| Mistral-Large-2402 \| 37.7 \| (-1.9, 2.6) \| 400 \|
	\| Mixtral-8x22B-Instruct-v0.1 \| 36.4 \| (-2.7, 2.9) \| 430 \|
	\| Qwen1.5-72B-Chat \| 36.1 \| (-2.5, 2.2) \| 474 \|
	\| Command-R-Plus \| 33.1 \| (-2.1, 2.2) \| 541 \|
	\| Mistral-Medium \| 31.9 \| (-2.3, 2.4) \| 485 \|
	\| GPT-3.5-Turbo-0613 \| 24.8 \| (-1.6, 2.0) \| 401 \|

	## MT-Bench

	First turn

	\| Model \| Turn \| Score \|
	\|--------------------------\|------\|---------\|
	\| Qwen2-72B-Instruct \| 1 \| 9.18125 \|
	\| Smaug-Qwen2-72B-Instruct \| 1 \| 9.05625 \|

	Second turn

	\| Model \| Turn \| Score \|
	\|--------------------------\|------\|---------\|
	\| Qwen2-72B-Instruct \| 2 \| 8.74684 \|
	\| Smaug-Qwen2-72B-Instruct \| 2 \| 8.67500 \|

	Average

	\| Model \| Score \|
	\|--------------------------\|---------\|
	\| Qwen2-72B-Instruct \| 8.96541 \|
	\| Smaug-Qwen2-72B-Instruct \| 8.86563 \|

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_abacusai__Smaug-Qwen2-72B-Instruct)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \|41.08\|
	\|IFEval (0-Shot) \|78.25\|
	\|BBH (3-Shot) \|56.27\|
	\|MATH Lvl 5 (4-Shot)\|35.35\|
	\|GPQA (0-shot) \|14.88\|
	\|MuSR (0-shot) \|15.18\|
	\|MMLU-PRO (5-shot) \|46.56\|