weqweasdas commited on
Commit
7ecd22c
·
1 Parent(s): dcd09c3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ # Reward model for HH-RLHF
8
+
9
+ <!-- Provide a quick summary of what the model is/does. -->
10
+
11
+ In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model isfor the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
12
+
13
+ ## Model Details
14
+
15
+ ### Dataset preprocessing
16
+
17
+ <!-- Provide a longer summary of what this model is. -->
18
+
19
+ The HH-RLHF dataset contains 112K comparison samples in the training set and 12.5K comparison samples in the test set. We first replace the ``\n\nHuman'' and ``\n\nAssistant'' in the dataset by ``###Human'' and ``###Assistant'', respectively.
20
+
21
+ Then, we split the dataset as follows:
22
+
23
+ - SFT dataset: 112K training samples + the first 6275 samples in the test set, we only use the chosen responses;
24
+ - Training set of reward modeling: 112K training samples + the first 6275 samples in the test set, we use both the chosen and rejected responses;
25
+ - Test set of reward modeling: the last 6226 samples of the original test set.
26
+
27
+ ### Training
28
+
29
+ To use the data more efficiently, we concatenate texts with an EOS token in between and split them into 1024-sized chunks, rather than padding them according to the longest text (in each batch). We then finetune the base model on the SFT dataset for two epochs, using a learning rate of 2e-5 and a linear decay schedule.
30
+
31
+ We conduct reward modeling with learning rate 5e-6 for 1 epoch and linear decay schedule because it seems that the model easily overfits with more than 1 epoches. We discard the samples longer than 512 tokens so we have approximately 10.6K samples in the training set and 5K samples in the test set for reward modeling.
32
+
33
+ We use bf16 and do not use LoRA in both of the stages.
34
+
35
+ **The resulting model achieves an evaluation loss of 0.5 and an evaluation accuracy 75.48%.**
36
+
37
+
38
+
39
+ ## Uses
40
+
41
+ ```python
42
+ rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/hh_rlhf_rm_open_llama_3b")
43
+
44
+ rm_pipe = pipeline(
45
+ "sentiment-analysis",
46
+ model="weqweasdas/hh_rlhf_rm_open_llama_3b",
47
+ device="auto",
48
+ tokenizer=rm_tokenizer,
49
+ model_kwargs={"torch_dtype": torch.bfloat16}
50
+ )
51
+
52
+ pipe_kwargs = {
53
+ "return_all_scores": True,
54
+ "function_to_apply": "none",
55
+ "batch_size": 1
56
+ }
57
+
58
+ test_texts = [
59
+ "###Human: My daughter wants to know how to convert fractions to decimals, but I'm not sure how to explain it. Can you help? ###Assistant: Sure. So one way of converting fractions to decimals is to ask “how many halves are there?” and then write this as a decimal number. But that's a little tricky. Here's a simpler way: if a fraction is expressed as a/b, then it's decimal equivalent is just a/b * 1.0 So, for example, the decimal equivalent of 1/2 is 1/2 * 1.0 = 0.5.",
60
+ "###Human: I have fresh whole chicken in my fridge. What dish can I prepare using it that will take me less than an hour to cook? ###Assistant: Are you interested in a quick and easy recipe you can prepare with chicken you have on hand, or something more involved? In terms of both effort and time, what are you looking for?"]
61
+
62
+ pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
63
+ rewards = [output[0]["score"] for output in pipe_outputs]
64
+ ```
65
+
66
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
67
+
68
+
69
+
70
+ ## Reference
71
+
72
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
73
+
74
+ If you found this model useful, please cite our framework and paper using the following BibTeX:
75
+
76
+
77
+ ```
78
+ @article{diao2023lmflow,
79
+ title={Lmflow: An extensible toolkit for finetuning and inference of large foundation models},
80
+ author={Diao, Shizhe and Pan, Rui and Dong, Hanze and Shum, Ka Shun and Zhang, Jipeng and Xiong, Wei and Zhang, Tong},
81
+ journal={arXiv preprint arXiv:2306.12420},
82
+ year={2023}
83
+ }
84
+ ```
85
+ ```
86
+ @article{dong2023raft,
87
+ title={Raft: Reward ranked finetuning for generative foundation model alignment},
88
+ author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
89
+ journal={arXiv preprint arXiv:2304.06767},
90
+ year={2023}
91
+ }
92
+ ```
93
+
94
+
95
+