|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Developed by:** Haoxiang Wang |
|
- **Model type:** Sequence Classifier |
|
- **Language(s) (NLP):** English |
|
- **License:** Apache-2.0 |
|
- **Finetuned from model [optional]:** https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/RLHFlow/directional-preference-alignment |
|
- **Paper [ACL 2024]:** [Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards](https://arxiv.org/abs/2402.18571) |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
The model has 10-dimensional output, corresponding to the following attributes from HelpSteer and UltraFeedback |
|
['helpsteer-helpfulness', 'helpsteer-correctness', 'helpsteer-coherence', 'helpsteer-complexity', 'helpsteer-verbosity', 'ultrafeedback-overall_score', "ultrafeedback-instruction_following", "ultrafeedback-truthfulness", "ultrafeedback-honesty", "ultrafeedback-helpfulness"] |
|
|
|
Here is a sample code that you can try |
|
```python |
|
from transformers import AutoModelForSequenceClassification,AutoTokenizer |
|
import torch |
|
device = 'cuda' |
|
path = "RLHFlow/RewardModel-Mistral-7B-for-DPA-v1" |
|
rm = AutoModelForSequenceClassification.from_pretrained(path, trust_remote_code=True).to(device) |
|
tokenizer = AutoTokenizer.from_pretrained(path) |
|
|
|
input_template = "[INST] You must read the following conversation carefully and rate the assistant's response from score 0-100 in these aspects: helpfulness, correctness, coherence, honesty, complexity, verbosity\n\nUser: {prompt}\n\nAssistant: {response} [/INST]" |
|
|
|
# Use a sample from HelpSteer validation set |
|
prompt = 'What are some synonyms for the word "beautiful"?' |
|
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant" |
|
|
|
model_inputs = tokenizer(input_template.format(prompt=prompt, response=response), return_tensors="pt").to(device) |
|
with torch.no_grad(): |
|
score = rm(**model_inputs).logits.squeeze().cpu().float().numpy() |
|
|
|
print(score) |
|
# [68.99269 69.62718 76.23071 33.48785 35.853596 63.833366 55.58917 68.7175 59.552124 46.465595] |
|
|
|
# Convert from our scale (0-100) to HelpSteer scale (0-4) |
|
helpsteer_rewards_pred = (score[:5]-10)/20 |
|
print(helpsteer_rewards_pred) |
|
# [2.9496346 2.981359 3.3115356 1.1743925 1.2926798] |
|
# The actual rewards from the HelpSteer dataset for this sample are [3,3,4,2,2] |
|
``` |
|
## Training |
|
|
|
![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/preference-conflict.jpg) |
|
|
|
![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/algo-illustration.jpg) |
|
|
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
If you find this work useful to your research, please consider citing our paper |
|
``` |
|
@inproceedings{wang2024arithmetic, |
|
title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards}, |
|
author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang}, |
|
year={2024}, |
|
booktitle={ACL}, |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
|
|
Haoxiang Wang |
|
|
|
## Model Card Contact |
|
|
|
hwang264@illinois.edu |
|
|
|
|
|
|