PairRM / README.md
yuchenlin's picture
Update README.md
447053d
|
raw
history blame
No virus
11.4 kB
metadata
license: mit
datasets:
  - openai/summarize_from_feedback
  - openai/webgpt_comparisons
  - Dahoas/instruct-synthetic-prompt-responses
  - Anthropic/hh-rlhf
  - lmsys/chatbot_arena_conversations
  - openbmb/UltraFeedback
metrics:
  - accuracy
tags:
  - reward_model
  - reward-model
  - RLHF
  - evaluation
  - llm
  - instruction
  - reranking
language:
  - en
pipeline_tag: text-generation

Pairwise Reward Model for LLMs (PairRM) from LLM-Blender

Introduction

Pairwise Reward Model (PairRM) takes an instruction and a pair of output candidates as the input, and output a score for each candidate to measure their relative quality. Unlike the other RMs that encode and score each candidate respectively, PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.

PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment. PairRM can also be used to enhance the decoding by best-of-n sampling (i.e., reranking N sampled outputs). Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods.

PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper linked above to know more.

Installation

  • First install llm-blender
pip install git+https://github.com/yuchenlin/LLM-Blender.git
  • Then load PairRM:
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM

Usage

Use case 1: Comparing/Ranking output candidates given an instruction

  • Ranking a list candidate responses
inputs = ["hello!", "I love you!"]
candidates_texts = [["get out!", "hi! nice to meet you!", "bye"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! nice to meet you!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 
  • Directly comparing two candidate responses
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# comparison_results[0]--> True 
  • Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "<assistant1‘s response 1>",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "<assistant2's response 1>",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2

Use case 2: Best-of-n Sampling (Decoding Enhancment)

Best-of-n Sampling, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more atOpenAI WebGPT section 3.2 and OpenAI Blog).

Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")

inputs = [...] # your list of inputs
system_message = {
    "role": "system",
    "content": "You are a friendly chatbot who always responds in the style of a pirate",
}
messages = [
    [   
        system_message,
        {"role": "user", "content": _input},
    ]
    for _input in zip(inputs)
]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:")
print(prompts[0])
print("### best-of-n generations:")
print(outputs[0])

Use case 3: RLHF

PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. We believe PairRM will power the alignment of LLM in an efficient and effective way. With a blender.compare() function, you can easily apply PairRM to poopular RLHF toolkits like trl.

🔥 Check more details on our example jupyter notebook usage: blender_usage.ipynb

Learn more in our LLM-Blender Github README.md

Statistics

Context length

PairRanker type Source max length Candidate max length Total max length
pair-ranker 128 128 384
PairRM (This model) 1224 412 2048

Performance

PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4.

We test the pairwise comparison on

Auto-J Pairwise test data performance

Model Summ Exam Code Rewriting Crea W Func W Comm NLP Overall
Closed -source Models
ChatGPT 33.3 40.3 36.6 31.6 48.2 40.4 47.6 45.8 42.7
Claude -2 30.6 36.1 41.7 34.2 48.1 42.5 40.6 48.5 42.4
GPT -4 59.7 51.4 69.2 58.3 66.7 60.4 58.3 65.2 61.9
Open -source Models
SteamSHP 33.3 29.2 26.7 33.3 40.7 31.3 51.4 51.9 40.6
PandaLM 29.2 33.3 31.7 23.3 43.5 32.9 44.8 48.9 38.9
LLaMA -2-Chat -13B 20.8 27.8 19.2 20 31.5 27.5 35.8 31.8 29
Vicuna -13B-v1.5 30.6 23.6 35 28.3 36.1 37.5 45.5 39.8 37.3
WizardLM -13B-v1.2 22.2 20.8 32.5 19.2 28.7 25.4 29.2 33 27.8
LLAMA -2-chat -70B 34.7 33.3 36.7 35.8 51.4 54.2 47.2 47.7 45.9
AUTO -J (13b) 45.8 38.9 59.2 47.5 54.6 57.1 58 57.6 54.8
PairRM (0.4b) 56.94 52.78 58.33 55.83 61.57 59.17 57.64 62.5 59.05

HHH-Alignment and MT-bench human judgements

Evaluator LM HHH ALIGNMENT MT BENCH HUMAN JUDG .
Help . Harm . Hon . Other Total Avg . Human Preference
RANDOM 50 50 50 50 50 34.26
STANFORDNLP REWARD MODEL 69.49 60.34 52.46 51.16 58.82 44.79
ALMOST REWARD MODEL 74.58 67.24 78.69 86.05 76.02 49.9
LLAMA2 -CHAT 7B 66.1 81.03 70.49 74.42 72.85 51.78
LLAMA2 -CHAT 13B 74.58 87.93 55.74 79.07 73.76 52.34
LLAMA2 -CHAT 70B 66.1 89.66 67.21 74.42 74.21 53.67
LLAMA2 -CHAT 13B+COARSE . 68.74 68.97 65.57 67.44 67.42 46.89
GPT -3.5-TURBO -0613 76.27 87.93 67.21 86.05 78.73 57.12
PROMETHEUS 7B 69.49 84.48 78.69 90.7 80.09 55.14
PROMETHEUS 13B 81.36 82.76 75.41 76.74 79.19 57.72
PairRM (0.4b) 84.75 84.48 80.33 90.7 84.62 59
GPT -4-0613 91.53 93.1 85.25 83.72 88.69 63.87

While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!

Two reasons to attribute:

  • Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
  • The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)

Citation & Credits

If you are using PairRM in your research, please cite LLM-blender.

@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}