RLHFlow
/

RewardModel-Mistral-7B-for-DPA-v1

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

RewardModel-Mistral-7B-for-DPA-v1 / README.md

Haoxiang-Wang's picture

Update README.md

9e6cd97 verified 7 months ago

|

3.54 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: Haoxiang Wang
	- Model type: Sequence Classifier
	- Language(s) (NLP): English
	- License: Apache-2.0
	- Finetuned from model [optional]: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/RLHFlow/directional-preference-alignment
	- Paper [optional]: https://arxiv.org/abs/2402.18571

	## How to Get Started with the Model

	Use the code below to get started with the model.

	The model has 10-dimensional output, corresponding to the following attributes from HelpSteer and UltraFeedback
	['helpsteer-helpfulness', 'helpsteer-correctness', 'helpsteer-coherence', 'helpsteer-complexity', 'helpsteer-verbosity', 'ultrafeedback-overall_score', "ultrafeedback-instruction_following", "ultrafeedback-truthfulness", "ultrafeedback-honesty", "ultrafeedback-helpfulness"]

	Here is a sample code that you can try
	```python
	from transformers import AutoModelForSequenceClassification,AutoTokenizer
	import torch
	device = 'cuda'
	path = "RLHFlow/RewardModel-Mistral-7B-for-DPA-v1"
	rm = AutoModelForSequenceClassification.from_pretrained(path, trust_remote_code=True).to(device)
	tokenizer = AutoTokenizer.from_pretrained(path)

	input_template = "[INST] You must read the following conversation carefully and rate the assistant's response from score 0-100 in these aspects: helpfulness, correctness, coherence, honesty, complexity, verbosity\n\nUser: {prompt}\n\nAssistant: {response} [/INST]"

	# Use a sample from HelpSteer validation set
	prompt = 'What are some synonyms for the word "beautiful"?'
	response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"

	model_inputs = tokenizer(input_template.format(prompt=prompt, response=response), return_tensors="pt").to(device)
	with torch.no_grad():
	score = rm(**model_inputs).logits.squeeze().cpu().float().numpy()

	print(score)
	# [68.99269 69.62718 76.23071 33.48785 35.853596 63.833366 55.58917 68.7175 59.552124 46.465595]

	# Convert from our scale (0-100) to HelpSteer scale (0-4)
	helpsteer_rewards_pred = (score[:5]-10)/20
	print(helpsteer_rewards_pred)
	# [2.9496346 2.981359 3.3115356 1.1743925 1.2926798]
	# The actual rewards from the HelpSteer dataset for this sample are [3,3,4,2,2]
	```
	## Training

	![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/preference-conflict.jpg)

	![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/algo-illustration.jpg)


	## Citation

	BibTeX:
	If you find this work useful to your research, please consider citing our paper
	```
	@article{wang2024arithmetic,
	title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards},
	author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
	year={2024},
	eprint={2402.18571},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	## Model Card Authors

	Haoxiang Wang

	## Model Card Contact

	hwang264@illinois.edu