hendrydong
commited on
Commit
•
1c58e48
1
Parent(s):
096b3be
Update README.md
Browse files
README.md
CHANGED
@@ -6,8 +6,6 @@ The base model is `meta-llama/Meta-Llama-3-8B-Instruct`.
|
|
6 |
|
7 |
We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
|
8 |
|
9 |
-
You can also refer to a short blog for RM training details: https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0.
|
10 |
-
|
11 |
|
12 |
## Uses
|
13 |
|
@@ -54,6 +52,9 @@ This Reward model is the SOTA open-source RM (Apr 20, 2024) on Reward-Bench.
|
|
54 |
| Safety | 88.76 |
|
55 |
| Reasoning | 88.3 |
|
56 |
|
|
|
|
|
|
|
57 |
|
58 |
## Reference
|
59 |
The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:
|
|
|
6 |
|
7 |
We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
|
8 |
|
|
|
|
|
9 |
|
10 |
## Uses
|
11 |
|
|
|
52 |
| Safety | 88.76 |
|
53 |
| Reasoning | 88.3 |
|
54 |
|
55 |
+
## See also
|
56 |
+
|
57 |
+
You can also refer to our short blog for RM training details: https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0.
|
58 |
|
59 |
## Reference
|
60 |
The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:
|