File size: 3,612 Bytes
12f4861
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a179a64
 
 
 
 
12f4861
 
 
 
 
a179a64
86583a7
12f4861
 
 
 
 
a179a64
 
12f4861
a179a64
 
 
 
 
9e6cd97
a179a64
 
12f4861
a179a64
12f4861
a179a64
 
 
12f4861
a179a64
 
 
12f4861
a179a64
 
12f4861
a179a64
 
 
 
 
 
 
12f4861
a179a64
12f4861
a179a64
12f4861
 
a179a64
12f4861
 
a179a64
 
2f791ed
a179a64
 
 
2f791ed
a179a64
 
 
 
 
 
12f4861
 
 
a179a64
12f4861
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
library_name: transformers
tags: []
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Haoxiang Wang
- **Model type:** Sequence Classifier
- **Language(s) (NLP):** English
- **License:** Apache-2.0
- **Finetuned from model [optional]:** https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/RLHFlow/directional-preference-alignment
- **Paper [ACL 2024]:** [Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards](https://arxiv.org/abs/2402.18571)

## How to Get Started with the Model

Use the code below to get started with the model.

The model has 10-dimensional output, corresponding to the following attributes from HelpSteer and UltraFeedback
['helpsteer-helpfulness', 'helpsteer-correctness', 'helpsteer-coherence', 'helpsteer-complexity', 'helpsteer-verbosity', 'ultrafeedback-overall_score', "ultrafeedback-instruction_following", "ultrafeedback-truthfulness", "ultrafeedback-honesty", "ultrafeedback-helpfulness"]

Here is a sample code that you can try
```python
from transformers import AutoModelForSequenceClassification,AutoTokenizer
import torch
device = 'cuda'
path = "RLHFlow/RewardModel-Mistral-7B-for-DPA-v1"
rm = AutoModelForSequenceClassification.from_pretrained(path, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(path) 

input_template = "[INST] You must read the following conversation carefully and rate the assistant's response from score 0-100 in these aspects: helpfulness, correctness, coherence, honesty, complexity, verbosity\n\nUser: {prompt}\n\nAssistant: {response} [/INST]"

# Use a sample from HelpSteer validation set
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"

model_inputs = tokenizer(input_template.format(prompt=prompt, response=response), return_tensors="pt").to(device)
with torch.no_grad():
    score = rm(**model_inputs).logits.squeeze().cpu().float().numpy()

print(score)
# [68.99269  69.62718  76.23071  33.48785  35.853596 63.833366 55.58917 68.7175 59.552124 46.465595]

# Convert from our scale (0-100) to HelpSteer scale (0-4) 
helpsteer_rewards_pred = (score[:5]-10)/20
print(helpsteer_rewards_pred)
# [2.9496346 2.981359  3.3115356 1.1743925 1.2926798]
# The actual rewards from the HelpSteer dataset for this sample are [3,3,4,2,2]
```
## Training

![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/preference-conflict.jpg)

![image/png](https://github.com/RLHFlow/directional-preference-alignment/raw/main/assets/algo-illustration.jpg)


## Citation

**BibTeX:**
If you find this work useful to your research, please consider citing our paper
```
@inproceedings{wang2024arithmetic,
      title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards}, 
      author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
      year={2024},
      booktitle={ACL},
}
```

## Model Card Authors

Haoxiang Wang

## Model Card Contact

hwang264@illinois.edu