File size: 11,446 Bytes
d09aaea
 
bb45a4c
 
 
 
 
 
 
 
 
 
 
 
 
0305f74
 
 
 
bb45a4c
 
0305f74
d09aaea
bb45a4c
671d616
 
 
bb45a4c
 
7333fb2
bb45a4c
c845907
0305f74
bb45a4c
c845907
 
 
 
 
 
 
447053d
 
 
c845907
0305f74
 
447053d
bb45a4c
 
 
 
 
447053d
bb45a4c
 
 
 
 
 
0305f74
 
 
447053d
bb45a4c
447053d
bb45a4c
 
447053d
 
 
bb45a4c
 
447053d
 
 
 
 
bb45a4c
 
447053d
bb45a4c
447053d
 
 
bb45a4c
447053d
 
bb45a4c
 
 
 
 
 
 
 
 
 
447053d
bb45a4c
 
 
 
 
 
 
 
 
 
447053d
bb45a4c
 
 
 
 
 
 
 
447053d
bb45a4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ef6e21
 
bb45a4c
 
 
 
 
 
 
94512ba
 
 
 
 
 
 
 
 
 
 
 
345b1ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0305f74
 
 
bb45a4c
 
 
 
 
 
 
 
 
 
0305f74
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
license: mit
datasets:
- openai/summarize_from_feedback
- openai/webgpt_comparisons
- Dahoas/instruct-synthetic-prompt-responses
- Anthropic/hh-rlhf
- lmsys/chatbot_arena_conversations
- openbmb/UltraFeedback
metrics:
- accuracy
tags:
- reward_model
- reward-model
- RLHF
- evaluation
- llm
- instruction
- reranking
language:
- en
pipeline_tag: text-generation
---

# Pairwise Reward Model for LLMs (PairRM) from LLM-Blender 


- Github: [https://github.com/yuchenlin/LLM-Blender](https://github.com/yuchenlin/LLM-Blender)
- Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561)
- Space Demo: [https://huggingface.co/spaces/llm-blender/LLM-Blender](https://huggingface.co/spaces/llm-blender/LLM-Blender)


## Introduction 

Pairwise Reward Model (PairRM) takes an instruction and a **pair** of output candidates as the input, 
and output a score for each candidate to measure their **relative** quality. 
Unlike the other RMs that encode and score each candidate respectively, 
PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.

PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs). 
Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods. 

PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper linked above to know more.


## Installation

- First install `llm-blender`
```bash
pip install git+https://github.com/yuchenlin/LLM-Blender.git
```

- Then load PairRM:
```python
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM
```


## Usage 

### Use case 1: Comparing/Ranking output candidates given an instruction

- Ranking a list candidate responses

```python
inputs = ["hello!", "I love you!"]
candidates_texts = [["get out!", "hi! nice to meet you!", "bye"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! nice to meet you!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 
```

- Directly comparing two candidate responses
```python
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# comparison_results[0]--> True 
```

- Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
```python
conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "<assistant1‘s response 1>",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "<assistant2's response 1>",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
```

### Use case 2: Best-of-n Sampling (Decoding Enhancment)
**Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)). 

Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")

inputs = [...] # your list of inputs
system_message = {
    "role": "system",
    "content": "You are a friendly chatbot who always responds in the style of a pirate",
}
messages = [
    [   
        system_message,
        {"role": "user", "content": _input},
    ]
    for _input in zip(inputs)
]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:")
print(prompts[0])
print("### best-of-n generations:")
print(outputs[0])
```

### Use case 3: RLHF 
PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. 
We believe PairRM will power the alignment of LLM in an efficient and effective way.
With a `blender.compare()` function, you can easily apply PairRM to poopular RLHF toolkits like [trl](https://huggingface.co/docs/trl/index). 

**🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)**


Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion)




## Statistics

### Context length
|  PairRanker type  | Source max length | Candidate max length | Total max length |
|:-----------------:|:-----------------:|----------------------|------------------|
| [pair-ranker](https://huggingface.co/llm-blender/pair-ranker)               | 128               | 128                  | 384              |
| [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224              | 412                  | 2048             |


### Performance
PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences 
with an extremly small model size (0.4B), approching the performance of GPT-4.

We test the pairwise comparison on 
- [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
- [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
- [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)

#### Auto-J Pairwise test data performance

|         Model         |    Summ   |    Exam   |    Code   | Rewriting |   Crea W  |   Func W  |  Comm |    NLP   |  Overall  |
|:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
| Closed -source Models |
|        ChatGPT        |    33.3   |    40.3   |    36.6   |    31.6   |    48.2   |    40.4   |  47.6 |   45.8   |    42.7   |
|       Claude -2       |    30.6   |    36.1   |    41.7   |    34.2   |    48.1   |    42.5   |  40.6 |   48.5   |    42.4   |
|         GPT -4        |    59.7   |    51.4   |    69.2   |    58.3   |    66.7   |    60.4   |  58.3 |   65.2   |    61.9   |
|  Open -source Models  |
|        SteamSHP       |    33.3   |    29.2   |    26.7   |    33.3   |    40.7   |    31.3   |  51.4 |   51.9   |    40.6   |
|        PandaLM        |    29.2   |    33.3   |    31.7   |    23.3   |    43.5   |    32.9   |  44.8 |   48.9   |    38.9   |
|   LLaMA -2-Chat -13B  |    20.8   |    27.8   |    19.2   |     20    |    31.5   |    27.5   |  35.8 |   31.8   |     29    |
|    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
|   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
|   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
|       AUTO -J (13b)       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   **58**  |   57.6   |    54.8   |
|         **PairRM (0.4b)**       | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |

#### HHH-Alignment and MT-bench human judgements

|        Evaluator LM       | HHH ALIGNMENT |           |           |          |             | MT BENCH HUMAN JUDG . |
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
|                           |     Help .    |   Harm .  |   Hon .   |   Other  | Total Avg . |    Human Preference   |
|           RANDOM          |       50      |     50    |     50    |    50    |      50     |         34.26         |
|  STANFORDNLP REWARD MODEL |     69.49     |   60.34   |   52.46   |   51.16  |    58.82    |         44.79         |
|    ALMOST REWARD MODEL    |     74.58     |   67.24   |   78.69   |   86.05  |    76.02    |          49.9         |
|      LLAMA2 -CHAT 7B      |      66.1     |   81.03   |   70.49   |   74.42  |    72.85    |         51.78         |
|      LLAMA2 -CHAT 13B     |     74.58     |   87.93   |   55.74   |   79.07  |    73.76    |         52.34         |
|      LLAMA2 -CHAT 70B     |      66.1     |   **89.66**   |   67.21   |   74.42  |    74.21    |         53.67         |
| LLAMA2 -CHAT 13B+COARSE . |     68.74     |   68.97   |   65.57   |   67.44  |    67.42    |         46.89         |
|    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
|       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
|       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
|           **PairRM (0.4b)**          |   **84.75**   |   84.48   | **80.33** | **90.7** |  **84.62**  |         **59**        |
|        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |

**While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**

Two reasons to attribute:
- Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
- The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)






## Citation & Credits 
If you are using PairRM in your research, please cite LLM-blender.
```bibtex
@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}

```