Text Generation
Transformers
Safetensors
English
deberta
reward_model
reward-model
RLHF
evaluation
llm
instruction
reranking
Inference Endpoints
Dongfu Jiang commited on
Commit
c1046ee
1 Parent(s): 5e3cb95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -224,8 +224,9 @@ We test the pairwise comparison on
224
  | Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
225
  | WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
226
  | LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
227
- | AUTO -J (13b) | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 |
228
- | **PairRM (0.4b)** | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
 
229
 
230
  #### HHH-Alignment and MT-bench human judgements
231
 
@@ -242,7 +243,8 @@ We test the pairwise comparison on
242
  | GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
243
  | PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
244
  | PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
245
- | **PairRM (0.4b)** | **84.75** | 84.48 | **80.33** | **90.7** | **84.62** | **59** |
 
246
  | GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
247
 
248
  **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
 
224
  | Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
225
  | WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
226
  | LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
227
+ | AUTO -J (13b) | 45.8 | 38.9 | **59.2** | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 |
228
+ | UltraRM (13b) | 56.94 | 43.06 | 55.0 | 53.33 | 67.13 | **64.17** | 56.25 | 59.85 | **59.85** |
229
+ | **PairRM (0.4b)** | **56.94** | **52.78** | 58.33 | **55.83** | **61.57** | 59.17 | 57.64 | **62.5** | 59.05 |
230
 
231
  #### HHH-Alignment and MT-bench human judgements
232
 
 
243
  | GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
244
  | PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
245
  | PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
246
+ | UltraRM (13B) | **86.44** | 79.31 | **81.97** | 88.37 | 83.71 | 56 |
247
+ | **PairRM (0.4B)** | 84.75 | 84.48 | 80.33 | **90.7** | **84.62** | **59** |
248
  | GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
249
 
250
  **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**