Text Generation
Transformers
Safetensors
English
deberta
reward_model
reward-model
RLHF
evaluation
llm
instruction
reranking
Inference Endpoints
yuchenlin commited on
Commit
e066c87
1 Parent(s): 90f9aa4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -34,13 +34,14 @@ pipeline_tag: text-generation
34
 
35
  Pairwise Reward Model (PairRM) takes an instruction and a **pair** of output candidates as the input,
36
  and output a score for each candidate to measure their **relative** quality.
37
- Unlike the other RMs that encode and score each candidate respectively,
38
- PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.
39
-
40
  PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
41
  PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs).
42
  Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods.
43
 
 
 
 
 
44
  PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper linked above to know more.
45
 
46
 
 
34
 
35
  Pairwise Reward Model (PairRM) takes an instruction and a **pair** of output candidates as the input,
36
  and output a score for each candidate to measure their **relative** quality.
 
 
 
37
  PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
38
  PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs).
39
  Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods.
40
 
41
+ Unlike the other RMs that encode and score each candidate respectively,
42
+ PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.
43
+ Also, PairRM is based on DeBERTa-large, and thus it is super efficient: 0.4B.
44
+ We trained PairRM on a diverse collection of human preference datasets such as UltraFeedback, HH-RLHF, chatbot-arena, etc.
45
  PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper linked above to know more.
46
 
47