llm-blender
/

PairRM

@@ -61,7 +61,7 @@ blender.loadranker("llm-blender/PairRM") # load PairRM
 ## Usage
-### Use case 1: Comparing/Ranking output candidates given an instruction
 - Ranking a list candidate responses
@@ -88,7 +88,8 @@ comparison_results = blender.compare(inputs, candidates_A, candidates_B)
 # comparison_results[0]--> True
 ```
-- Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
 ```python
 conv1 = [
     {
@@ -96,7 +97,7 @@ conv1 = [
         "role": "USER"
     },
     {
-        "content": "<assistant1‘s response 1>",
         "role": "ASSISTANT"
     },
     ...
@@ -107,7 +108,7 @@ conv2 = [
         "role": "USER"
     },
     {
-        "content": "<assistant2's response 1>",
         "role": "ASSISTANT"
     },
     ...
@@ -115,36 +116,51 @@ conv2 = [
 comparison_results = blender.compare_conversations([conv1], [conv2])
 # comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
 ```
-### Use case 2: Best-of-n Sampling (Decoding Enhancment)
-**Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
-Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
 model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
-inputs = [...] # your list of inputs
-system_message = {
-    "role": "system",
-    "content": "You are a friendly chatbot who always responds in the style of a pirate",
-}
-messages = [
-    [
-        system_message,
-        {"role": "user", "content": _input},
-    ]
-    for _input in zip(inputs)
-]
 prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
 outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
-print("### Prompt:")
-print(prompts[0])
-print("### best-of-n generations:")
-print(outputs[0])
 ```
 ### Use case 3: RLHF

 ## Usage
+### Use Case 1: Comparing/Ranking output candidates given an instruction
 - Ranking a list candidate responses
 # comparison_results[0]--> True
 ```
+<details><summary> Comparing two multi-turn conversations. </summary>
 ```python
 conv1 = [
     {
         "role": "USER"
     },
     {
+        "content": "[assistant1‘s response 1]",
         "role": "ASSISTANT"
     },
     ...
         "role": "USER"
     },
     {
+        "content": "[assistant2's response 1]",
         "role": "ASSISTANT"
     },
     ...
 comparison_results = blender.compare_conversations([conv1], [conv2])
 # comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
 ```
+</details>
+### Use Case 2: Best-of-n Sampling (Decoding Enhancment)
+**Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model
+(see more in [OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
+Best-of-n sampling with PairRM is a very easy way to imporve your LLMs with only a few changes of your inference code:
 ```python
+# loading models
+import llm_blender
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
 model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
+system_message = {"role": "system", "content": "You are a friendly chatbot."}
+# formatting your inputs
+inputs = ["can you tell me a joke about OpenAI?"]
+messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
 prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
+# Conventional generation method
+input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
+sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
+print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
+# --> The output could be a bad case such as a very short one, e.g., `Sure`
+# PairRM for best-of-n sampling
+blender = llm_blender.Blender()
+blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
 outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
+print("### Prompt:\n", prompts[0])
+print("### best-of-n generations:\n", outputs[0])
+# --> The output will be much more stable and consistently better than single sampling, for example:
+"""
+Sure, here's a joke about OpenAI:
+Why did OpenAI decide to hire a mime as their new AI researcher?
+Because they wanted someone who could communicate complex ideas without making a sound!
+(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
+"""
 ```
 ### Use case 3: RLHF