weqweasdas
commited on
Commit
·
5015d45
1
Parent(s):
23cf908
Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@
|
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
11 |
-
In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model
|
12 |
|
13 |
## Model Details
|
14 |
|
@@ -65,6 +65,13 @@ We use bf16 and do not use LoRA in both of the stages.
|
|
65 |
|
66 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
|
70 |
## Reference
|
|
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
11 |
+
In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model is for the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
|
12 |
|
13 |
## Model Details
|
14 |
|
|
|
65 |
|
66 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
67 |
|
68 |
+
### RAFT Example
|
69 |
+
|
70 |
+
We test the reward model by the [RAFT (Reward ranked finetuning)](https://arxiv.org/pdf/2304.06767.pdf) and with EleutherAI/gpt-neo-2.7B as the starting checkpoint.
|
71 |
+
|
72 |
+
For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
|
73 |
+
|
74 |
+
![Reward Curve of RAFT](raft.png)
|
75 |
|
76 |
|
77 |
## Reference
|