weqweasdas
commited on
Commit
·
7ecd22c
1
Parent(s):
dcd09c3
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
|
3 |
+
# Doc / guide: https://huggingface.co/docs/hub/model-cards
|
4 |
+
{}
|
5 |
+
---
|
6 |
+
|
7 |
+
# Reward model for HH-RLHF
|
8 |
+
|
9 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
+
|
11 |
+
In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model isfor the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
|
12 |
+
|
13 |
+
## Model Details
|
14 |
+
|
15 |
+
### Dataset preprocessing
|
16 |
+
|
17 |
+
<!-- Provide a longer summary of what this model is. -->
|
18 |
+
|
19 |
+
The HH-RLHF dataset contains 112K comparison samples in the training set and 12.5K comparison samples in the test set. We first replace the ``\n\nHuman'' and ``\n\nAssistant'' in the dataset by ``###Human'' and ``###Assistant'', respectively.
|
20 |
+
|
21 |
+
Then, we split the dataset as follows:
|
22 |
+
|
23 |
+
- SFT dataset: 112K training samples + the first 6275 samples in the test set, we only use the chosen responses;
|
24 |
+
- Training set of reward modeling: 112K training samples + the first 6275 samples in the test set, we use both the chosen and rejected responses;
|
25 |
+
- Test set of reward modeling: the last 6226 samples of the original test set.
|
26 |
+
|
27 |
+
### Training
|
28 |
+
|
29 |
+
To use the data more efficiently, we concatenate texts with an EOS token in between and split them into 1024-sized chunks, rather than padding them according to the longest text (in each batch). We then finetune the base model on the SFT dataset for two epochs, using a learning rate of 2e-5 and a linear decay schedule.
|
30 |
+
|
31 |
+
We conduct reward modeling with learning rate 5e-6 for 1 epoch and linear decay schedule because it seems that the model easily overfits with more than 1 epoches. We discard the samples longer than 512 tokens so we have approximately 10.6K samples in the training set and 5K samples in the test set for reward modeling.
|
32 |
+
|
33 |
+
We use bf16 and do not use LoRA in both of the stages.
|
34 |
+
|
35 |
+
**The resulting model achieves an evaluation loss of 0.5 and an evaluation accuracy 75.48%.**
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
## Uses
|
40 |
+
|
41 |
+
```python
|
42 |
+
rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/hh_rlhf_rm_open_llama_3b")
|
43 |
+
|
44 |
+
rm_pipe = pipeline(
|
45 |
+
"sentiment-analysis",
|
46 |
+
model="weqweasdas/hh_rlhf_rm_open_llama_3b",
|
47 |
+
device="auto",
|
48 |
+
tokenizer=rm_tokenizer,
|
49 |
+
model_kwargs={"torch_dtype": torch.bfloat16}
|
50 |
+
)
|
51 |
+
|
52 |
+
pipe_kwargs = {
|
53 |
+
"return_all_scores": True,
|
54 |
+
"function_to_apply": "none",
|
55 |
+
"batch_size": 1
|
56 |
+
}
|
57 |
+
|
58 |
+
test_texts = [
|
59 |
+
"###Human: My daughter wants to know how to convert fractions to decimals, but I'm not sure how to explain it. Can you help? ###Assistant: Sure. So one way of converting fractions to decimals is to ask “how many halves are there?” and then write this as a decimal number. But that's a little tricky. Here's a simpler way: if a fraction is expressed as a/b, then it's decimal equivalent is just a/b * 1.0 So, for example, the decimal equivalent of 1/2 is 1/2 * 1.0 = 0.5.",
|
60 |
+
"###Human: I have fresh whole chicken in my fridge. What dish can I prepare using it that will take me less than an hour to cook? ###Assistant: Are you interested in a quick and easy recipe you can prepare with chicken you have on hand, or something more involved? In terms of both effort and time, what are you looking for?"]
|
61 |
+
|
62 |
+
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
|
63 |
+
rewards = [output[0]["score"] for output in pipe_outputs]
|
64 |
+
```
|
65 |
+
|
66 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
67 |
+
|
68 |
+
|
69 |
+
|
70 |
+
## Reference
|
71 |
+
|
72 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
73 |
+
|
74 |
+
If you found this model useful, please cite our framework and paper using the following BibTeX:
|
75 |
+
|
76 |
+
|
77 |
+
```
|
78 |
+
@article{diao2023lmflow,
|
79 |
+
title={Lmflow: An extensible toolkit for finetuning and inference of large foundation models},
|
80 |
+
author={Diao, Shizhe and Pan, Rui and Dong, Hanze and Shum, Ka Shun and Zhang, Jipeng and Xiong, Wei and Zhang, Tong},
|
81 |
+
journal={arXiv preprint arXiv:2306.12420},
|
82 |
+
year={2023}
|
83 |
+
}
|
84 |
+
```
|
85 |
+
```
|
86 |
+
@article{dong2023raft,
|
87 |
+
title={Raft: Reward ranked finetuning for generative foundation model alignment},
|
88 |
+
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
|
89 |
+
journal={arXiv preprint arXiv:2304.06767},
|
90 |
+
year={2023}
|
91 |
+
}
|
92 |
+
```
|
93 |
+
|
94 |
+
|
95 |
+
|