Text Generation
Transformers
Safetensors
English
deberta
reward_model
reward-model
RLHF
evaluation
llm
instruction
reranking
Inference Endpoints
yuchenlin commited on
Commit
504eb7b
1 Parent(s): a20b0a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -3
README.md CHANGED
@@ -41,9 +41,15 @@ Apart from that, one can also use PairRM to further align instruction-tuned LLMs
41
  Unlike the other RMs that encode and score each candidate respectively,
42
  PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.
43
  Also, PairRM is based on [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large), and thus it is super efficient: 0.4B.
44
- We trained PairRM on a diverse collection of human preference datasets such as [`UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), [`HH-RLHF`](https://huggingface.co/datasets/Anthropic/hh-rlhf),
45
- [`summarize_from_feedback`](https://huggingface.co/datasets/openai/summarize_from_feedback), [`chatbot-arena`](), etc.
46
- PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper linked above to know more.
 
 
 
 
 
 
47
 
48
 
49
  ## Installation
 
41
  Unlike the other RMs that encode and score each candidate respectively,
42
  PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.
43
  Also, PairRM is based on [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large), and thus it is super efficient: 0.4B.
44
+ We trained PairRM on a diverse collection of six human-preference datasets:
45
+ - [`UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback)
46
+ - [`HH-RLHF`](https://huggingface.co/datasets/Anthropic/hh-rlhf)
47
+ - [`summarize_from_feedback`](https://huggingface.co/datasets/openai/summarize_from_feedback)
48
+ - [`chatbot_arena_conversations`](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
49
+ - [`webgpt_comparisons`](https://huggingface.co/datasets/openai/webgpt_comparisons)
50
+ - [`instruct-synthetic-prompt-responses`](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses).
51
+ -
52
+ PairRM is part of the LLM-Blender project (ACL 2023). Please see our [paper](https://arxiv.org/abs/2306.02561) above to know more.
53
 
54
 
55
  ## Installation