Edit model card

Model Card for Mistral-7B-ReMax-v0.1

The Mistral-7B-ReMax-v0.1 Large Language Model (LLM) is a Reinforcement Learning from Human Preference (RLHF) fine-tuned version of Mistral-7B-Instruct-v0.2.

The fine-tuning algorithm is ReMax and please find algorithm details in the paper.

Model Details

Uses

Direct Use

The instruction format is the same as Mistral-7B-Instruct-v0.2. Specifically, the prompt should be surrounded by [INST] and [/INST] tokens.

text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"

Training Details

Training Data

10k prompts from the lmsys-chat-1m dataset. Note that no responses from this dataset are used in training.

Reward Model

The reward model is based on UltraRM-13b.

Important: UltraRM-13b uses a different instruction template with Mistral-7B. To address this issue, we change the chat template to that of UltraRM-13b when evaluating the reward score in RLHF fine-tuning.

Training Procedure

The training algorithm is ReMax, whose details are available in the paper, and implementations are available in the repository.

Training Hyperparameters

max_prompt_seq: 384
max_answer_seq: 384
learning_rate: 5e-7
num_train_epochs: 1
kl_penalty: full_step
kl_ctl: 0.1
reward_clip_value: 1.0
temperature for sampling: 0.7
top_p for sampling: 0.9

Evaluation

Chat1M

Since we fine-tune Mistral-7B-Instruct-v0.2 with the lmsys-chat-1m dataset, we first evaluate the quality of responses. We randomly select 500 prompts from the lmsys-chat-1m dataset and lets the models to generate responses. Then, we let GPT-4 to evaluate the binary win rate. The average win-rate of Mistral-7B-ReMax-v0.1 is 57.8%.

Mistral-7B-ReMax-v0.1 Wins Mistral-7B-ReMax-v0.1 Wins
Mistral-7B-Instruct-v0.2 289 211

Samples of responses are provided below.

==================Instruction================
你好
==================Mistral-7B-Instruct-v0.2================
你好!我是一位帮助您解答问题的机器人。如果您有任何问题,请告诉我,我会尽力回答。

你好(Hello)是中文语言中的一个常用词语,常用于问候或表示亲和,意思是“你好,你很好”。

在我帮助您之前,我想先问一下,您是否知道什么是机器学习?机器学习是人工智能领域中的一种技术,它允许计算机自动学习和改进其行为,以便从数据中提取出有价值的信息,并使计算机能够做出更智能的决策。

如果您已经熟悉机器学习,那么我可以帮助您深入探讨相关主题,如神经网络、支持向量机、决策树等。但是,如果您对机器学习不太熟悉,我可以简单地解释它的基本原理,以及它在现实生活中的应用。

请告诉我您的需求,我会尽力满足您的期望。

再次问候您:你好!
==================Mistral-7B-ReMax-v0.1================
好的,欢迎使用我的帮助!(Hello, welcome to use my help!)

是的,我可以以中文回答你的问题,或者帮你完成一些简单的任务。(Yes, I can answer your questions in Chinese or help you with some simple tasks.)

请问有什么特定的问题需要我帮忙呢?(What specific question do you need my help with?)

或者,你可以告诉我什么方面我可以提供帮助。(Or you can tell me which area I can provide help in.)

希望我能帮到你!(I hope I can help you!)

AlpacaEval

AlpacaEval provides 805 test questions that the model has not been trained on. We provide the evaluation results below:

AlpacaEval AlpacaEval 2 Response Length
Mistral-7B-Instruct-v0.2 92.78 14.72 1676
Mistral-7B-ReMax-v0.1 94.40 16.00 1478
LLaMa2 Chat 7B 71.37 4.96 1479
LLaMa2 Chat 70B 92.66 13.87 1790
GPT-4 95.28 23.58 2049

Samples of responses are provided below

==================Instruction================
Marie is at the music store for a store day. She bought two $50 combos. Then, for $25 she bought two t-shirts for $5 each and five hoodies for $2 each. Then, she bought sneakers for $8 each. How much did she spend total?
==================Mistral-7B-Instruct-v0.2================
Marie bought 2 combos at $50 each, so that's $100. She bought 2 t-shirts at $5 each, which is $10. She bought 5 hoodies at $2 each, which is $10. The sneakers cost $8. So Marie spent a total of $118. ($100 + $10 + $8 = $118)
==================Mistral-7B-ReMax-v0.1================
Marie bought two combos at $50 each, so that's $100. She bought two t-shirts for $5 each, which totals $10. She bought five hoodies for $2 each, so that's $10. The sneakers cost $8.

So Marie spent $100 + $10 + $10 + $8 = $128 in total.

Limitations

The Mistral-7B-ReMax-v0.1 model demonstrates that the optimization algorithm ReMax is effective. This model shares the same limitations as the Mistral-7B-Instruct-v0.2 model. In particular, it does not have any moderation mechanisms.

Citation

BibTeX:

@article{li2023remax,
  title={Remax: A simple, effective, and efficient method for aligning large language models},
  author={Li, Ziniu and Xu, Tian and Zhang, Yushun and Yu, Yang and Sun, Ruoyu and Luo, Zhi-Quan},
  journal={arXiv preprint arXiv:2310.10505},
  year={2023}
}
Downloads last month
86
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using ziniuli/Mistral-7B-ReMax-v0.1 5