Model Card for Merlinite-7B-pt 🔥

Overview

We introduce Merlinite-7B-pt, a strong open-source chat model, aligned using AI feedback without proprietary models or using any human annotation.

Merlinite-7B-pt is first supervised-finetuned (SFT) via LAB using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements.

The official Merlinite-7B-pt achieves 7.96 on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.

Performance

Model	Alignment	Base	Teacher	MTBench*	MMLU(5-shot)	ARC-C(25-shot)	HellaSwag(10-shot)	Winogrande(5-shot)	GSM8K(5-shot- strict)
Llama-2-13b-chat-hf	RLHF	Llama-2-13b	Human Annotators	6.65	54.58	59.81	82.52	75.93	34.80
Orca-2-13b	Progressive Training	Llama-2-13b	GPT-4	6.15	60.37 *	59.73	79.86	78.22	48.22
WizardLM-13B-V1.2	Evol-Instruct	Llama-2-13b	GPT-4	7.20	54.83	60.24	82.62	76.40	43.75
Labradorite-13b	Large-scale Alignment for chatBots (LAB)	Llama-2-13b	Mixtral-8x7B-Instruct	7.23	58.89	61.69	83.15	79.56	40.11
Mistral-7B-Instruct-v0.1	SFT	Mistral-7B-v0.1	-	6.84	60.37	63.65	84.76	76.80	41.85
zephyr-7b-beta	SFT/DPO	Mistral-7B-v0.1	GPT-4	7.34	61.07	63.74	84.19	78.06	34.04
Mistral-7B-Instruct-v0.2	SFT	Mistral-7B-v0.1	-	7.6**	60.78	63.14	84.88	77.19	40.03
Merlinite-7b	Large-scale Alignment for chatBots (LAB)	Mistral-7B-v0.1	Mixtral-8x7B-Instruct	7.66	64.88	63.99	84.37	78.24	44.58
Merlinite-7b-pt	LAB + RLAIF	Mistral-7B-v0.1	Mixtral-8x7B-Instruct	7.96 ***	63.59	64.50	84.28	79.72	48.67

[*] Numbers for models other than Merlinite-7b, Merlinite-7b-pt and Labradorite-13b (ours) are taken from lmsys/chatbot-arena-leaderboard

[**] Numbers taken from MistralAI Release Blog

[***] Merlinite-7b-pt model exhibits variability on the MT-Bench evaluation. The 5-run average score is 7.85, with highest 7.96 and lowest score 7.80.

Method

Above shows MT-Bench score comparisons on 8 prompt domains

Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].

We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the RewardBench leaderboard.

Having Mixtral log-ratio as reward model, we then choose iterative rejection sampling fine-tuning as the RL alignment method. For each prompt, we sample ( N ) times from the current optimal policy (starting from the SFT model). We then query the preference reward and select the highest scoring sample as the target. The initial policy is updated through supervised fine-tuning based on the outputs of rejection sampling. This process is iterated by conducting additional rounds of best-of-N sampling followed by SFT training.

The prompts space for preference tuning were uniformly sampled by source from the LAB SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.

[^1]: Lambert, 2024. RewardBench: Evaluating Reward Models for Language Modeling.

Discussion

The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.

We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as shown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer giving improvements on either MT-Bench or Mixtral-DPO rewards.

The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.

Model description

Language(s): Primarily English
License: Apache 2.0
Base model: mistralai/Mistral-7B-v0.1
Teacher Model: mistralai/Mixtral-8x7B-Instruct-v0.1
Reward Model: DPO Log-ratio Rewards from mistralai/Mixtral-8x7B-Instruct-v0.1

Prompt Template

sys_prompt = "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior."

prompt = f'<|system|>\n{sys_prompt}\n<|user|>\n{inputs}\n<|assistant|>\n'
stop_token = '<|endoftext|>'

We advise utilizing the system prompt employed during the model's training for optimal inference performance, as there could be performance variations based on the provided instructions.

Bias, Risks, and Limitations

The model has been tuned via AI preference. However, this is not a targeted RLHF for model harmlessness. The risks and constraints with respect to model safety remains. The model also maintains the limitations and constraints that arise from the base model.

The model undergoes training on synthetic data, leading to the potential inheritance of both advantages and limitations from the underlying teacher models and data generation methods. The incorporation of safety measures during Merlinite-7b-pt's training process is considered beneficial. However, a nuanced understanding of the associated risks requires detailed studies for more accurate quantification.

In the absence of adequate safeguards, there exists a risk of malicious utilization of these models for generating disinformation or harmful content. Caution is urged against complete reliance on a specific language model for crucial decisions or impactful information, as preventing these models from fabricating content is not straightforward. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in ungrounded generation scenarios due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.

Acknowledgements

Guangxuan Xu,
Project lead.

Akash Srivastava,
Primary advisor

Kai Xu,
Advised on evaluation and model training.

Tahira Naseem,
Advised on DPO rewards.

Abhishek Bhandwaldar,
Advised on distributed sampling and reward annotation implementation.

Thanks to Luis Lastras, David D. Cox, Ruchir Puri, and Sriram Raghavan for enabling this project and for provisioning the resources.

gx-ai-architect
/

merlinite-placeholder