llm-blender
/

PairRM

@@ -39,17 +39,19 @@ PairRM takes a pair of candidates and compares them side-by-side to indentify th
 PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
 PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs).
-Apart from that, one can also use PairRM to
 ## Installation
-Since PairRanker contains some custom layers and tokens. We recommend use PairRM with our llm-blender code API.
 - First install `llm-blender`
 ```bash
 pip install git+https://github.com/yuchenlin/LLM-Blender.git
 ```
-- Then load pairranker with the following code:
 ```python
 import llm_blender
 blender = llm_blender.Blender()
@@ -59,23 +61,31 @@ blender.loadranker("llm-blender/PairRM") # load PairRM
 ## Usage
-### Use case 1: Compare responses (Quality Evaluator)
-- Then you can rank candidate responses with the following function
 ```python
-inputs = ["input1", "input2"]
-candidates_texts = [["candidate1 for input1", "candidatefor input1"], ["candidate1 for input2", "candidate2 for input2"]]
 ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
 # ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
 ```
-- Directly compare two candidate responses
 ```python
-candidates_A = [cands[0] for cands in candidates]
-candidates_B = [cands[1] for cands in candidates]
 comparison_results = blender.compare(inputs, candidates_A, candidates_B)
-# comparison_results is a list of bool, where element[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
 ```
 - Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
@@ -86,7 +96,7 @@ conv1 = [
         "role": "USER"
     },
     {
-        "content": "<assistant response>",
         "role": "ASSISTANT"
     },
     ...
@@ -97,7 +107,7 @@ conv2 = [
         "role": "USER"
     },
     {
-        "content": "<assistant response>",
         "role": "ASSISTANT"
     },
     ...
@@ -106,7 +116,7 @@ comparison_results = blender.compare_conversations([conv1], [conv2])
 # comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
 ```
-### Use case 2: Best-of-n sampling (Decoding Enhancing)
 **Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
 Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.

 PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
 PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs).
+Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods.
+PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper linked above to know more.
 ## Installation
 - First install `llm-blender`
 ```bash
 pip install git+https://github.com/yuchenlin/LLM-Blender.git
 ```
+- Then load PairRM:
 ```python
 import llm_blender
 blender = llm_blender.Blender()
 ## Usage
+### Use case 1: Comparing/Ranking output candidates given an instruction
+- Ranking a list candidate responses
 ```python
+inputs = ["hello!", "I love you!"]
+candidates_texts = [["get out!", "hi! nice to meet you!", "bye"],
+                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
 ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
 # ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
+"""
+ranks -->
+array([[3, 1, 2], # it means "hi! nice to meet you!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd.
+       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
+       dtype=int32)
 ```
+- Directly comparing two candidate responses
 ```python
+inputs = ["hello!", "I love you!"]
+candidates_A = ["hi!", "I hate you!"]
+candidates_B = ["f**k off!", "I love you, too!"]
 comparison_results = blender.compare(inputs, candidates_A, candidates_B)
+# comparison_results is a list of bool, where comparison_results[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
+# comparison_results[0]--> True
 ```
 - Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
         "role": "USER"
     },
     {
+        "content": "<assistant1‘s response 1>",
         "role": "ASSISTANT"
     },
     ...
         "role": "USER"
     },
     {
+        "content": "<assistant2's response 1>",
         "role": "ASSISTANT"
     },
     ...
 # comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
 ```
+### Use case 2: Best-of-n Sampling (Decoding Enhancment)
 **Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
 Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.