Haoxiang-Wang
commited on
Commit
•
f39a6b6
1
Parent(s):
5131470
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,116 @@
|
|
1 |
-
---
|
2 |
-
license: llama3
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3
|
3 |
+
---
|
4 |
+
|
5 |
+
# Arbitrary-Rating Multi-Objective Reward Model (ArmoRM) with Mixture-of-Experts (MoE) Aggregation of Reward Objectives
|
6 |
+
|
7 |
+
|
8 |
+
|
9 |
+
+ **Authors** (* indicates equal contribution)
|
10 |
+
|
11 |
+
[Haoxiang Wang*](https://haoxiang-wang.github.io/), [Wei Xiong*](https://weixiongust.github.io/WeiXiongUST/index.html), [Tengyang Xie](https://tengyangxie.github.io/), [Han Zhao](https://hanzhaoml.github.io/), [Tong Zhang](https://tongzhang-ml.org/)
|
12 |
+
|
13 |
+
+ **Blog**: To appear soon (with implementation details)
|
14 |
+
+ **Tech Report**: To be released in June 2024
|
15 |
+
+ **Model**: [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1)
|
16 |
+
+ Finetuned from model: [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1)
|
17 |
+
- **Code Repository:** https://github.com/RLHFlow/RLHF-Reward-Modeling/
|
18 |
+
+ **Architecture**
|
19 |
+
|
20 |
+
<p align="center">
|
21 |
+
<img width="800" alt="image" src="https://github.com/RLHFlow/RLHFlow.github.io/blob/main/assets/ArmoRM-MoE.png?raw=true">
|
22 |
+
</p>
|
23 |
+
|
24 |
+
## RewardBench LeaderBoard
|
25 |
+
|
26 |
+
| Base Model | Method | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
|
27 |
+
|:-----------|:-------|:-----:|:-----|:----------|:-------|:----------|:-----------------------|
|
28 |
+
| ArmoRM-Llama3-8B-v0.1 | Llama-3 8B | ArmoRM + MoE | **88.97** | 96.9 | **76.8** | **92.2** | **97.3** | 74.3 |
|
29 |
+
| Cohere May 2024 | Unknown | Unknown | 88.25 | 96.4 | 71.3 | **92.7** | **97.7** | **78.2** |
|
30 |
+
| GPT-4 Turbo (0125 version) | GPT-4 Turbo | LLM-as-a-Judge | 84.25 | 95.3 | 74.3 | 87.2 | 86.9 | 70.9 |
|
31 |
+
| [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) | Llama-3 8B | Bradley-Terry | 83.61 | **99.4** | 65.1 | 87.8 | 86.4 | 74.9 |
|
32 |
+
| [Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B) | Yi-34B | Bradley-Terry | 81.44 | 96.9 | 57.2 | 88.2 | 88.5 | 71.4 |
|
33 |
+
|
34 |
+
## Demo Code
|
35 |
+
```python
|
36 |
+
import torch
|
37 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
38 |
+
device = "cuda"
|
39 |
+
path = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
|
40 |
+
model = AutoModelForSequenceClassification.from_pretrained(path, device_map=device,
|
41 |
+
trust_remote_code=True, torch_dtype=torch.bfloat16)
|
42 |
+
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)
|
43 |
+
# We load a random sample from the validation set of the HelpSteer dataset
|
44 |
+
prompt = 'What are some synonyms for the word "beautiful"?'
|
45 |
+
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
|
46 |
+
messages = [{"role": "user", "content": prompt},
|
47 |
+
{"role": "assistant", "content": response}]
|
48 |
+
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
|
49 |
+
with torch.no_grad():
|
50 |
+
output = model(input_ids)
|
51 |
+
# Multi-objective rewards for the response
|
52 |
+
multi_obj_rewards = output.rewards.cpu().float()
|
53 |
+
# The gating layer's output is conditioned on the prompt
|
54 |
+
gating_output = output.gating_output.cpu().float()
|
55 |
+
# The preference score for the response, aggregated from the
|
56 |
+
# multi-objective rewards with the gating layer
|
57 |
+
preference_score = output.score.cpu().float()
|
58 |
+
# We apply a transformation matrix to the multi-objective rewards
|
59 |
+
# before multiplying with the gating layer's output. This mainly aims
|
60 |
+
# at reducing the verbosity bias of the original reward objectives
|
61 |
+
obj_transform = model.reward_transform_matrix.data.cpu().float()
|
62 |
+
# The final coefficients assigned to each reward objective
|
63 |
+
multi_obj_coeffs = gating_output @ obj_transform.T
|
64 |
+
# The preference score is the linear combination of the multi-objective rewards with
|
65 |
+
# the multi-objective coefficients, which can be verified by the following assertion
|
66 |
+
assert torch.isclose(torch.sum(multi_obj_rewards * multi_obj_coeffs, dim=1), preference_score, atol=1e-3)
|
67 |
+
# Find the top-K reward objectives with coefficients of the highest magnitude
|
68 |
+
K = 3
|
69 |
+
top_obj_dims = torch.argsort(torch.abs(multi_obj_coeffs), dim=1, descending=True,)[:, :K]
|
70 |
+
top_obj_coeffs = torch.gather(multi_obj_coeffs, dim=1, index=top_obj_dims)
|
71 |
+
|
72 |
+
# The attributes of the 19 reward objectives
|
73 |
+
attributes = ['helpsteer-helpfulness','helpsteer-correctness','helpsteer-coherence',
|
74 |
+
'helpsteer-complexity','helpsteer-verbosity','ultrafeedback-overall_score',
|
75 |
+
'ultrafeedback-instruction_following', 'ultrafeedback-truthfulness',
|
76 |
+
'ultrafeedback-honesty','ultrafeedback-helpfulness','beavertails-is_safe',
|
77 |
+
'prometheus-score','argilla-overall_quality','argilla-judge_lm','code-complexity',
|
78 |
+
'code-style','code-explanation','code-instruction-following','code-readability']
|
79 |
+
|
80 |
+
example_index = 0
|
81 |
+
for i in range(K):
|
82 |
+
attribute = attributes[top_obj_dims[example_index, i].item()]
|
83 |
+
coeff = top_obj_coeffs[example_index, i].item()
|
84 |
+
print(f"{attribute}: {round(coeff,5)}")
|
85 |
+
# code-complexity: 0.19922
|
86 |
+
# helpsteer-verbosity: -0.10864
|
87 |
+
# ultrafeedback-instruction_following: 0.07861
|
88 |
+
|
89 |
+
# The actual rewards of this example from the HelpSteer dataset
|
90 |
+
# are [3,3,4,2,2] for the five helpsteer objectives:
|
91 |
+
# helpfulness, correctness, coherence, complexity, verbosity
|
92 |
+
# We can linearly transform our predicted rewards to the
|
93 |
+
# original reward space to compare with the ground truth
|
94 |
+
helpsteer_rewards_pred = multi_obj_rewards[0, :5] * 5 - 0.5
|
95 |
+
print(helpsteer_rewards_pred)
|
96 |
+
# [2.78125 2.859375 3.484375 1.3847656 1.296875 ]
|
97 |
+
```
|
98 |
+
|
99 |
+
## Citation
|
100 |
+
|
101 |
+
If you find this work useful for your research, please consider citing:
|
102 |
+
```
|
103 |
+
@misc{wang2024interpretable,
|
104 |
+
title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
|
105 |
+
author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
|
106 |
+
year={2024}
|
107 |
+
}
|
108 |
+
|
109 |
+
@inproceedings{wang2024arithmetic,
|
110 |
+
title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards},
|
111 |
+
author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
|
112 |
+
year={2024},
|
113 |
+
booktitle={ACL},
|
114 |
+
}
|
115 |
+
```
|
116 |
+
The second entry, "[Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards](https://arxiv.org/abs/2402.18571)", is another recent work of ours that trained a multi-objective reward model and adopted it for LLM alignment, which motivated us to develop the current work.
|