|
--- |
|
license: llama3.1 |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
datasets: |
|
- allenai/RLVR-GSM-MATH-IF-Mixed-Constraints |
|
base_model: |
|
- allenai/Llama-3.1-Tulu-3-8B-DPO |
|
library_name: transformers |
|
--- |
|
|
|
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu3/Tulu3-logo.png" alt="Tulu 3 banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> |
|
|
|
# Llama-3.1-Tulu-3-8B |
|
|
|
Tülu3 is a leading instruction following model family, offering fully open-source data, code, and recipes designed to serve as a comprehensive guide for modern post-training techniques. |
|
Tülu3 is designed for state-of-the-art performance on a diversity of tasks in addition to chat, such as MATH, GSM8K, and IFEval. |
|
|
|
## Model description |
|
|
|
- **Model type:** A model trained on a mix of publicly available, synthetic and human-created datasets. |
|
- **Language(s) (NLP):** Primarily English |
|
- **License:** Llama 3.1 Community License Agreement |
|
- **Finetuned from model:** allenai/Llama-3.1-Tulu-3-8B-DPO |
|
|
|
### Model Sources |
|
|
|
- **Training Repository:** https://github.com/allenai/open-instruct |
|
- **Eval Repository:** https://github.com/allenai/olmes |
|
- **Paper:** https://allenai.org/papers/tulu-3-report.pdf (arXiv soon) |
|
- **Demo:** https://playground.allenai.org/ |
|
|
|
### Model Family |
|
|
|
| **Stage** | **Llama 3.1 8B** | **Llama 3.1 70B** | |
|
|----------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------| |
|
| **Base Model** | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B) | |
|
| **SFT** | [allenai/Llama-3.1-Tulu-3-8B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) | [allenai/Llama-3.1-Tulu-3-70B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-SFT) | |
|
| **DPO** | [allenai/Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO) | [allenai/Llama-3.1-Tulu-3-70B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-DPO) | |
|
| **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) | |
|
| **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) | |
|
|
|
## Using the model |
|
|
|
### Loading with HuggingFace |
|
|
|
To load the model with HuggingFace, use the following snippet: |
|
``` |
|
from transformers import AutoModelForCausalLM |
|
|
|
tulu_model = AutoModelForCausalLM.from_pretrained("allenai/Llama-3.1-Tulu-3-8B") |
|
``` |
|
|
|
### VLLM |
|
|
|
As a Llama base model, the model can be easily served with: |
|
``` |
|
vllm serve allenai/Llama-3.1-Tulu-3-8B |
|
``` |
|
Note that given the long chat template of Llama, you may want to use `--max_model_len=8192`. |
|
|
|
### Chat template |
|
|
|
The chat template for our models is formatted as: |
|
``` |
|
<|user|>\nHow are you doing?\n<|assistant|>\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|> |
|
``` |
|
Or with new lines expanded: |
|
``` |
|
<|user|> |
|
How are you doing? |
|
<|assistant|> |
|
I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|> |
|
``` |
|
It is embedded within the tokenizer as well, for `tokenizer.apply_chat_template`. |
|
|
|
### System prompt |
|
|
|
In Ai2 demos, we use this system prompt by default: |
|
``` |
|
You are Tulu 3, a helpful and harmless AI Assistant built by the Allen Institute for AI. |
|
``` |
|
The model has not been trained with a specific system prompt in mind. |
|
|
|
### Bias, Risks, and Limitations |
|
|
|
The Tülu3 models have limited safety training, but are not deployed automatically with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). |
|
It is also unknown what the size and composition of the corpus was used to train the base Llama 3.1 models, however it is likely to have included a mix of Web data and technical sources like books and code. |
|
See the Falcon 180B model card for an example of this. |
|
|
|
|
|
## Performance |
|
|
|
| Benchmark (eval) | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct | |
|
|---------------------------------|----------------|----------------|------------|------------------------|----------------------|-----------|---------------------|-----------------------| |
|
| **Avg.** | 60.4 | 64.4 | **64.8** | 62.2 | 57.8 | 44.7 | 55.2 | 58.3 | |
|
| **MMLU (0 shot, CoT)** | 65.9 | 68.7 | 68.2 | 71.2 | **76.6** | 62.0 | 74.6 | 68.5 | |
|
| **PopQA (15 shot)** | **29.3** | 29.3 | 29.1 | 20.2 | 18.1 | 22.5 | 28.3 | 20.2 | |
|
| **TruthfulQA (6 shot)** | 46.8 | 56.1 | 55.0 | 55.1 | **63.1** | 57.0 | 61.4 | 55.5 | |
|
| **BigBenchHard (3 shot, CoT)** | **67.9** | 65.8 | 66.0 | 62.8 | 21.7 | 0.9 | 2.5 | 56.2 | |
|
| **DROP (3 shot)** | 61.3 | 62.5 | **62.6** | 61.5 | 54.4 | 49.4 | 58.8 | 56.2 | |
|
| **MATH (4 shot CoT, Flex)** | 31.5 | 42.0 | **43.7** | 42.5 | 14.8 | 5.1 | 29.8 | 40.0 | |
|
| **GSM8K (8 shot, CoT)** | 76.2 | 84.3 | **87.6** | 83.4 | 83.8 | 61.2 | 79.7 | 80.0 | |
|
| **HumanEval (pass@10)** | 86.2 | 83.9 | 83.9 | 86.3 | **93.1** | 75.4 | 71.7 | 91.0 | |
|
| **HumanEval+ (pass@10)** | 81.4 | 78.6 | 79.2 | 82.9 | **89.7** | 69.1 | 67.0 | 88.5 | |
|
| **IFEval (prompt loose)** | 72.8 | 81.1 | **82.4** | 80.6 | 74.7 | 38.8 | 69.9 | 56.4 | |
|
| **AlpacaEval 2 (LC % win)** | 12.4 | 33.5 | 34.5 | 24.2 | 29.0 | **49.0** | 43.7 | 31.4 | |
|
| **Safety (6 task avg.)** | **93.1** | 87.2 | 85.5 | 75.2 | 75.0 | 46.4 | 75.5 | 56.2 | |
|
|
|
| Benchmark (eval) | Tülu 3 70B SFT | Tülu 3 DPO 70B | Tülu 3 70B | Llama 3.1 70B Instruct | Qwen 2.5 72B Instruct | Hermes 3 Llama 3.1 70B | Nemotron Llama 3.1 70B | |
|
|---------------------------------|-----------------|-----------------|-------------|-------------------------|-----------------------|------------------------|-------------------------| |
|
| **Avg.** | 72.6 | 75.9 | **76.0** | 73.4 | 71.5 | 68.3 | 65.5 | |
|
| **MMLU (0 shot, CoT)** | 78.9 | 83.3 | 83.1 | 85.3 | **85.5** | 80.4 | 83.8 | |
|
| **PopQA (15 shot)** | **48.6** | 46.3 | 46.5 | 46.4 | 30.6 | 48.1 | 36.4 | |
|
| **TruthfulQA (6 shot)** | 55.7 | 67.9 | 67.6 | 66.8 | **69.9** | 66.5 | 62.6 | |
|
| **BigBenchHard (3 shot, CoT)** | **82.7** | 81.8 | 82.0 | 73.8 | 67.2 | 82.1 | 0.7 | |
|
| **DROP (3 shot)** | **77.2** | 74.1 | 74.3 | 77.0 | 34.2 | 73.2 | 68.8 | |
|
| **MATH (4 shot CoT, Flex)** | 53.7 | 62.3 | 63.0 | 56.4 | **74.3** | 41.9 | 55.0 | |
|
| **GSM8K (8 shot, CoT)** | 91.1 | 93.5 | 93.5 | **93.7** | 89.5 | 90.0 | 84.7 | |
|
| **HumanEval (pass@10)** | 92.9 | 92.4 | 92.4 | 93.6 | 94.0 | 89.6 | **94.1** | |
|
| **HumanEval+ (pass@10)** | 87.3 | 88.4 | 88.0 | 89.5 | **90.8** | 85.9 | 85.5 | |
|
| **IFEval (prompt loose)** | 82.1 | 82.6 | 83.2 | **88.0** | 87.6 | 76.0 | 79.9 | |
|
| **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** | |
|
| **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 | |
|
|
|
|
|
## Hyperparamters |
|
|
|
PPO settings for RLVR: |
|
- **Learning Rate**: 3 × 10⁻⁷ |
|
- **Discount Factor (gamma)**: 1.0 |
|
- **General Advantage Estimation (lambda)**: 0.95 |
|
- **Mini-batches (N_mb)**: 1 |
|
- **PPO Update Iterations (K)**: 4 |
|
- **PPO's Clipping Coefficient (epsilon)**: 0.2 |
|
- **Value Function Coefficient (c1)**: 0.1 |
|
- **Gradient Norm Threshold**: 1.0 |
|
- **Learning Rate Schedule**: Linear |
|
- **Generation Temperature**: 1.0 |
|
- **Batch Size (effective)**: 512 |
|
- **Max Token Length**: 2,048 |
|
- **Max Prompt Token Length**: 2,048 |
|
- **Penalty Reward Value for Responses without an EOS Token**: -10.0 |
|
- **Response Length**: 1,024 (but 2,048 for MATH) |
|
- **Total Episodes**: 100,000 |
|
- **KL penalty coefficient (beta)**: [0.1, 0.05, 0.03, 0.01] |
|
- **Warm up ratio (omega)**: 0.0 |
|
|
|
## License and use |
|
|
|
All Llama 3.1 Tülu3 models are released under Meta's [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). |
|
Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. |
|
Tülu3 is intended for research and educational use. |
|
For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use). |
|
|
|
The models have been fine-tuned using a dataset mix with outputs generated from third party models and are subject to additional terms: |
|
[Gemma Terms of Use](https://ai.google.dev/gemma/terms) and [Qwen License Agreement](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) (models were improved using Qwen 2.5). |
|
|
|
|
|
## Citation |
|
|
|
If Tülu3 or any of the related materials were helpful to your work, please cite: |
|
``` |
|
@article{lambert2024tulu3, |
|
title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training}, |
|
author = { |
|
Nathan Lambert and |
|
Jacob Morrison and |
|
Valentina Pyatkin and |
|
Shengyi Huang and |
|
Hamish Ivison and |
|
Faeze Brahman and |
|
Lester James V. Miranda and |
|
Alisa Liu and |
|
Nouha Dziri and |
|
Shane Lyu and |
|
Yuling Gu and |
|
Saumya Malik and |
|
Victoria Graf and |
|
Jena D. Hwang and |
|
Jiangjiang Yang and |
|
Ronan Le Bras and |
|
Oyvind Tafjord and |
|
Chris Wilhelm and |
|
Luca Soldaini and |
|
Noah A. Smith and |
|
Yizhong Wang and |
|
Pradeep Dasigi and |
|
Hannaneh Hajishirzi |
|
}, |
|
year = {2024}, |
|
email = {tulu@allenai.org} |
|
} |
|
``` |