--- license: llama3 library_name: transformers datasets: - aqua_rat - microsoft/orca-math-word-problems-200k - m-a-p/CodeFeedback-Filtered-Instruction model-index: - name: Smaug-Llama-3-70B-Instruct results: - task: type: text-generation name: Text Generation dataset: name: ENEM Challenge (No Images) type: eduagarcia/enem_challenge split: train args: num_few_shot: 3 metrics: - type: acc value: 77.89 name: accuracy source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: BLUEX (No Images) type: eduagarcia-temp/BLUEX_without_images split: train args: num_few_shot: 3 metrics: - type: acc value: 69.54 name: accuracy source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: OAB Exams type: eduagarcia/oab_exams split: train args: num_few_shot: 3 metrics: - type: acc value: 63.64 name: accuracy source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: Assin2 RTE type: assin2 split: test args: num_few_shot: 15 metrics: - type: f1_macro value: 93.62 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: Assin2 STS type: eduagarcia/portuguese_benchmark split: test args: num_few_shot: 15 metrics: - type: pearson value: 78.52 name: pearson source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: FaQuAD NLI type: ruanchaves/faquad-nli split: test args: num_few_shot: 15 metrics: - type: f1_macro value: 80.01 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: HateBR Binary type: ruanchaves/hatebr split: test args: num_few_shot: 25 metrics: - type: f1_macro value: 91.78 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: PT Hate Speech Binary type: hate_speech_portuguese split: test args: num_few_shot: 25 metrics: - type: f1_macro value: 68.36 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: tweetSentBR type: eduagarcia/tweetsentbr_fewshot split: test args: num_few_shot: 25 metrics: - type: f1_macro value: 70.29 name: f1-macro source: url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct name: Open Portuguese LLM Leaderboard --- # Smaug-Llama-3-70B-Instruct ### Built with Meta Llama 3 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/ZxYuHKmU_AtuEJbGtuEBC.png) This model was built using a new Smaug recipe for improving performance on real world multi-turn conversations applied to [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct). The model outperforms Llama-3-70B-Instruct substantially, and is on par with GPT-4-Turbo, on MT-Bench (see below). EDIT: Smaug-Llama-3-70B-Instruct is the top open source model on Arena-Hard currently! It is also nearly on par with Claude Opus - see below. We are conducting additional benchmark evaluations and will add those when available. ### Model Description - **Developed by:** [Abacus.AI](https://abacus.ai) - **License:** https://llama.meta.com/llama3/license/ - **Finetuned from model:** [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct). ## How to use The prompt format is unchanged from Llama 3 70B Instruct. ### Use with transformers See the snippet below for usage with Transformers: ```python import transformers import torch model_id = "abacusai/Smaug-Llama-3-70B-Instruct" pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", ) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompt = pipeline.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) terminators = [ pipeline.tokenizer.eos_token_id, pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>") ] outputs = pipeline( prompt, max_new_tokens=256, eos_token_id=terminators, do_sample=True, temperature=0.6, top_p=0.9, ) print(outputs[0]["generated_text"][len(prompt):]) ``` ## Evaluation ### Arena-Hard Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology. | Model | Score | 95% Confidence Interval | Average Tokens | | :---- | ---------: | ----------: | ------: | | GPT-4-Turbo-2024-04-09 | 82.6 | (-1.8, 1.6) | 662 | | GPT-4o | 78.3 | (-2.4, 2.1) | 685 | | Gemini-1.5-pro-latest | 72.1 | (-2.3, 2.2) | 630 | | Claude-3-Opus-20240229 | 60.4 | (-3.3, 2.4) | 541 | | **Smaug-Llama-3-70B-Instruct** | 56.7 | (-2.2, 2.6) | 661 | | GPT-4-0314 | 50.0 | (-0.0, 0.0) | 423 | | Claude-3-Sonnet-20240229 | 46.8 | (-2.1, 2.2) | 552 | | Llama-3-70B-Instruct | 41.1 | (-2.5, 2.4) | 583 | | GPT-4-0613 | 37.9 | (-2.2, 2.0) | 354 | | Mistral-Large-2402 | 37.7 | (-1.9, 2.6) | 400 | | Mixtral-8x22B-Instruct-v0.1 | 36.4 | (-2.7, 2.9) | 430 | | Qwen1.5-72B-Chat | 36.1 | (-2.5, 2.2) | 474 | | Command-R-Plus | 33.1 | (-2.1, 2.2) | 541 | | Mistral-Medium | 31.9 | (-2.3, 2.4) | 485 | | GPT-3.5-Turbo-0613 | 24.8 | (-1.6, 2.0) | 401 | ### MT-Bench ``` ########## First turn ########## score model turn Smaug-Llama-3-70B-Instruct 1 9.40000 GPT-4-Turbo 1 9.37500 Meta-Llama-3-70B-Instruct 1 9.21250 ########## Second turn ########## score model turn Smaug-Llama-3-70B-Instruct 2 9.0125 GPT-4-Turbo 2 9.0000 Meta-Llama-3-70B-Instruct 2 8.8000 ########## Average ########## score model Smaug-Llama-3-70B-Instruct 9.206250 GPT-4-Turbo 9.187500 Meta-Llama-3-70B-Instruct 9.006250 ``` | Model | First turn | Second Turn | Average | | :---- | ---------: | ----------: | ------: | | **Smaug-Llama-3-70B-Instruct** | 9.40 | 9.01 | 9.21 | | GPT-4-Turbo | 9.38 | 9.00 | 9.19 | | Meta-Llama-3-70B-Instruct | 9.21 | 8.80 | 9.01 | ### OpenLLM Leaderboard Manual Evaluation | Model | ARC | Hellaswag | MMLU | TruthfulQA | Winogrande | GSM8K* | Average | | :---- | ---: | ------: | ---: | ---: | ---: | ---: | ---: | | Smaug-Llama-3-70B-Instruct | 70.6 | 86.1 | 79.2 | 62.5 | 83.5 | 90.5 | 78.7 | | Llama-3-70B-Instruct | 71.4 | 85.7 | 80.0 | 61.8 | 82.9 | 91.1 | 78.8 | **GSM8K** The GSM8K numbers quoted here are computed using a recent release of the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/). The commit used by the leaderboard has a significant issue that impacts models that tend to use `:` in their responses due to a bug in the stop word configuration for GSM8K. The issue is covered in more detail in this [GSM8K evaluation discussion](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/770). The score for both Llama-3 and this model are significantly different when evaluated with the updated harness as the issue with stop words has been addressed. This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: https://arxiv.org/abs/2402.13228. # Open Portuguese LLM Leaderboard Evaluation Results Detailed results can be found [here](https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/tree/main/abacusai/Smaug-Llama-3-70B-Instruct) and on the [🚀 Open Portuguese LLM Leaderboard](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard) | Metric | Value | |--------------------------|---------| |Average |**77.07**| |ENEM Challenge (No Images)| 77.89| |BLUEX (No Images) | 69.54| |OAB Exams | 63.64| |Assin2 RTE | 93.62| |Assin2 STS | 78.52| |FaQuAD NLI | 80.01| |HateBR Binary | 91.78| |PT Hate Speech Binary | 68.36| |tweetSentBR | 70.29|