File size: 8,962 Bytes
935856f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
pipeline_tag: text-generation
tags:
- merlinite-pt
- merlinite
- mistral
- ibm
- lab
- labrador
- labradorite
license: apache-2.0
language:
- en
base_model: mistralai/Mistral-7B-v0.1
---


# Model Card for Merlinite-7B-pt 🔥

### Overview
We introduce **Merlinite-7B-pt**, a strong open-source chat model, aligned using AI feedback **without proprietary models or using any human annotation**.
- **Merlinite-7B-pt** is first supervised-finetuned (SFT) via [LAB](https://arxiv.org/abs/2403.01081) using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
- Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
- We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements. 

The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.

### Performance


<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png)
 -->
| Model | Alignment | Base | Teacher | MTBench* | MMLU(5-shot) | ARC-C(25-shot) | HellaSwag(10-shot) | Winogrande(5-shot) | GSM8K(5-shot- strict) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| [Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | RLHF | Llama-2-13b | Human Annotators | 6.65  | 54.58 | 59.81 | 82.52 | 75.93 | 34.80 |
| [Orca-2-13b](https://huggingface.co/microsoft/Orca-2-13b) | Progressive Training | Llama-2-13b | GPT-4 | 6.15  | 60.37 * | 59.73 | 79.86 | 78.22 | 48.22 |
| [WizardLM-13B-V1.2](https://huggingface.co/WizardLM/WizardLM-13B-V1.2) | Evol-Instruct | Llama-2-13b | GPT-4 | 7.20  | 54.83 | 60.24 | 82.62 | 76.40 | 43.75 |
| [Labradorite-13b](https://huggingface.co/ibm/labradorite-13b) | Large-scale Alignment for chatBots (LAB) | Llama-2-13b | Mixtral-8x7B-Instruct | 7.23 | 58.89 | 61.69 | 83.15 | 79.56 | 40.11 |
| [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | SFT | Mistral-7B-v0.1 | - | 6.84 | 60.37 | 63.65  | 84.76 | 76.80 | 41.85 |
| [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | SFT/DPO | Mistral-7B-v0.1 | GPT-4 | 7.34 | 61.07 | 63.74 | 84.19 | 78.06 | 34.04 |
| [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | SFT | Mistral-7B-v0.1 | - | 7.6** | 60.78 | 63.14  | 84.88 | 77.19 | 40.03 |
| Merlinite-7b | Large-scale Alignment for chatBots (LAB) | Mistral-7B-v0.1 | Mixtral-8x7B-Instruct | 7.66 | 64.88 | 63.99 | 84.37 | 78.24 | 44.58 |
| Merlinite-7b-pt | LAB + RLAIF | Mistral-7B-v0.1 | Mixtral-8x7B-Instruct | 7.96 *** | 63.59 | 64.50 | 84.28 | 79.72 | 48.67 |


[*] Numbers for models other than Merlinite-7b, Merlinite-7b-pt  and [Labradorite-13b](https://huggingface.co/ibm/labradorite-13b) (ours) are taken from [lmsys/chatbot-arena-leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)

[**] Numbers taken from [MistralAI Release Blog](https://mistral.ai/news/la-plateforme/)

[***] Merlinite-7b-pt model exhibits variability on the MT-Bench evaluation. The 5-run average score is 7.85, with highest 7.96 and lowest score 7.80.

### Method

<img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">

**Above shows MT-Bench score comparisons on 8 prompt domains**


Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1]. 

We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).

Having Mixtral log-ratio as reward model, we then choose iterative rejection sampling fine-tuning as the RL alignment method. For each prompt, we sample \( N \) times from the current optimal policy (starting from the SFT model). We then query the preference reward and select the highest scoring sample as the target. The initial policy is updated through supervised fine-tuning based on the outputs of rejection sampling. This process is iterated by conducting additional rounds of best-of-N sampling followed by SFT training.

The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.

[^1]: Lambert, 2024. *RewardBench: Evaluating Reward Models for Language Modeling*.

### Discussion

<img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/Vt0eldYNUW1vOpLBd-_DI.png" width="650">

The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment. 

We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as shown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer giving improvements on either MT-Bench or Mixtral-DPO rewards. 

The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench. 


## Model description

- **Language(s):** Primarily English
- **License:** Apache 2.0
- **Base model:** [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- **Teacher Model:** [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
- **Reward Model:** DPO Log-ratio Rewards from [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)

## Prompt Template

```python
sys_prompt = "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior."

prompt = f'<|system|>\n{sys_prompt}\n<|user|>\n{inputs}\n<|assistant|>\n'
stop_token = '<|endoftext|>'
```

We advise utilizing the system prompt employed during the model's training for optimal inference performance, as there could be performance variations based on the provided instructions. 

## Bias, Risks, and Limitations

The model has been tuned via AI preference. However, this is not a targeted RLHF for model harmlessness. The risks and constraints with respect to model safety remains. The model also maintains the limitations and constraints that arise from the base model. 

The model undergoes training on synthetic data, leading to the potential inheritance of both advantages and limitations from the underlying teacher models and data generation methods. The incorporation of safety measures during Merlinite-7b-pt's training process is considered beneficial. However, a nuanced understanding of the associated risks requires detailed studies for more accurate quantification.

In the absence of adequate safeguards, there exists a risk of malicious utilization of these models for generating disinformation or harmful content. Caution is urged against complete reliance on a specific language model for crucial decisions or impactful information, as preventing these models from fabricating content is not straightforward. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in ungrounded generation scenarios due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.

### Acknowledgements

Guangxuan Xu,  
Project lead.

Akash Srivastava,  
Primary advisor

Kai Xu,  
Advised on evaluation and model training.

Tahira Naseem,  
Advised on DPO rewards.

Abhishek Bhandwaldar,  
Advised on distributed sampling and reward annotation implementation.

Thanks to Luis Lastras, David D. Cox, Ruchir Puri, and Sriram Raghavan for enabling this project and for provisioning the resources.