Text Generation
Transformers
Safetensors
English
deberta
reward_model
reward-model
RLHF
evaluation
llm
instruction
reranking
Inference Endpoints
Dongfu Jiang commited on
Commit
14d4a72
1 Parent(s): a2f8211

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -59,8 +59,8 @@ We test the pairwise comparison on
59
  | Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
60
  | WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
61
  | LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
62
- | AUTO -J 1 | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 |
63
- | PairRM | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
64
 
65
  #### HHH-Alignment and MT-bench human judgements
66
 
@@ -77,7 +77,7 @@ We test the pairwise comparison on
77
  | GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
78
  | PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
79
  | PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
80
- | PairRM | **84.75** | 84.48 | **80.33** | **90.7** | **84.62** | **59** |
81
  | GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
82
 
83
  **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
 
59
  | Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
60
  | WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
61
  | LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
62
+ | AUTO -J (13b) | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 |
63
+ | **PairRM (0.4b)** | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
64
 
65
  #### HHH-Alignment and MT-bench human judgements
66
 
 
77
  | GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
78
  | PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
79
  | PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
80
+ | **PairRM (0.4b)** | **84.75** | 84.48 | **80.33** | **90.7** | **84.62** | **59** |
81
  | GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
82
 
83
  **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**