Transformers
Safetensors
English
deberta-v2
reward_model
reward-model
RLHF
evaluation
llm
instruction
reranking
Inference Endpoints
maywell commited on
Commit
a775416
1 Parent(s): 4c4f60b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -9
README.md CHANGED
@@ -3,13 +3,13 @@ license: apache-2.0
3
  ---
4
  # Better Implementation for [*PairRM*](https://huggingface.co/llm-blender/PairRM)
5
 
6
- # **Introduction**
7
 
8
  This version of PairRM have some fixes on training process, which improve model's performance significantly.
9
 
10
- ## **Minor Fixes**
11
 
12
- ### Longer Context Length (2048 -> 3370)
13
 
14
  Thanks to deberta's tokenzer, original PairRM model had enough Context Length.
15
 
@@ -17,9 +17,9 @@ But, the longer the better :>
17
 
18
  ---
19
 
20
- ## **Major Fixes**
21
 
22
- ### Change Prompt Format
23
 
24
  Why use something like
25
  ```
@@ -30,12 +30,66 @@ So, I changed to a format based on Vicuna 1.1.
30
 
31
  ---
32
 
33
- ### Change Truncate side
34
 
35
- The original process was using right side truncate even on Input. This can cause serious problem when Input exceeds model's seq len.
36
 
37
  ---
38
 
39
- ### Dataset Filter
40
 
41
- There was decent amount of empty assistant response on original dataset. So, I dropped them.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
  # Better Implementation for [*PairRM*](https://huggingface.co/llm-blender/PairRM)
5
 
6
+ ## Introduction
7
 
8
  This version of PairRM have some fixes on training process, which improve model's performance significantly.
9
 
10
+ ### Minor Fixes
11
 
12
+ - Longer Context Length (2048 -> 3370)
13
 
14
  Thanks to deberta's tokenzer, original PairRM model had enough Context Length.
15
 
 
17
 
18
  ---
19
 
20
+ ### Major Fixes
21
 
22
+ - Change Prompt Format
23
 
24
  Why use something like
25
  ```
 
30
 
31
  ---
32
 
33
+ - Change Truncate side
34
 
35
+ The original process was using right side truncate even on Input. This can cause serious problem when Input exceeds model's context length.
36
 
37
  ---
38
 
39
+ - Dataset Filter
40
 
41
+ There was decent amount of empty assistant response on original dataset. So, I dropped them.
42
+
43
+ ---
44
+
45
+ ## Statistics
46
+
47
+ ### Context length
48
+ | PairRanker type | Source max length | Candidate max length | Total max length |
49
+ |:-----------------:|:-----------------:|----------------------|------------------|
50
+ | [pair-ranker](https://huggingface.co/llm-blender/pair-ranker) | 128 | 128 | 384 |
51
+ | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) | 1224 | 412 | 2048 |
52
+ | [Better-PairRM](https://huggingface.co/maywell/Better-PairRM/) (This model) | 2030 | 670 | 3370 |
53
+
54
+ ### Performance
55
+
56
+ #### Reward-Bench by AllenAI
57
+
58
+ | Metric | llm-blender/PairRM-hf | maywell/Better-PairRM |
59
+ |----------------------------|------------------------|------------------------|
60
+ | model | llm-blender/PairRM-hf | maywell/Better-PairRM |
61
+ | model_type | Custom Classifier | Custom Classifier |
62
+ | alpacaeval-length | 0.758 | **0.863** |
63
+ | alpacaeval-hard | 0.979 | **1.000** |
64
+ | alpacaeval-easy | 0.970 | **0.990** |
65
+ | donotanswer | 0.360 | **0.522** |
66
+ | hep-cpp | 0.628 | **0.646** |
67
+ | hep-go | 0.689 | **0.713** |
68
+ | hep-java | 0.628 | **0.713** |
69
+ | hep-js | 0.604 | **0.707** |
70
+ | hep-python | 0.646 | **0.713** |
71
+ | hep-rust | 0.652 | **0.726** |
72
+ | llmbar-adver-GPTInst | **0.304** | 0.141 |
73
+ | llmbar-adver-GPTOut | **0.596** | 0.447 |
74
+ | llmbar-adver-manual | **0.500** | 0.261 |
75
+ | llmbar-adver-neighbor | **0.433** | 0.276 |
76
+ | llmbar-natural | **0.800** | 0.720 |
77
+ | math-prm | **0.333** | 0.295 |
78
+ | mt-bench-hard | 0.649 | **0.703** |
79
+ | mt-bench-med | 0.900 | **1.000** |
80
+ | mt-bench-easy | **0.964** | 0.929 |
81
+ | refusals-dangerous | 0.080 | **0.730** |
82
+ | refusals-offensive | 0.010 | **0.940** |
83
+ | xstest-should-refuse | 0.370 | **0.968** |
84
+ | xstest-should-respond | **0.952** | 0.876 |
85
+ | average | 0.600 | **0.690** |
86
+
87
+ > *Note - llmbar test score is bit weird across all models on [Reward-Bench](https://huggingface.co/spaces/allenai/reward-bench)*
88
+
89
+ ## Thanks to
90
+
91
+ - [Sionic AI](https://sionic.ai/) for providing the A100 cluster.
92
+
93
+ ## Contact
94
+
95
+ - [Discord Server Link](https://discord.gg/MrBt3PXdXc)