weqweasdas
/

hh_rlhf_rm_open_llama_3b

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+# Reward model for HH-RLHF
+<!-- Provide a quick summary of what the model is/does. -->
+In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model isfor the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
+## Model Details
+### Dataset preprocessing
+<!-- Provide a longer summary of what this model is. -->
+The HH-RLHF dataset contains 112K comparison samples in the training set and 12.5K comparison samples in the test set. We first replace the ``\n\nHuman'' and ``\n\nAssistant'' in the dataset by ``###Human'' and ``###Assistant'', respectively.
+Then, we split the dataset as follows:
+- SFT dataset: 112K training samples + the first 6275 samples in the test set, we only use the chosen responses;
+- Training set of reward modeling: 112K training samples + the first 6275 samples in the test set, we use both the chosen and rejected responses;
+- Test set of reward modeling: the last 6226 samples of the original test set.
+### Training
+To use the data more efficiently, we concatenate texts with an EOS token in between and split them into 1024-sized chunks, rather than padding them according to the longest text (in each batch). We then finetune the base model on the SFT dataset for two epochs, using a learning rate of 2e-5 and a linear decay schedule.
+We conduct reward modeling with learning rate 5e-6 for 1 epoch and linear decay schedule because it seems that the model easily overfits with more than 1 epoches. We discard the samples longer than 512 tokens so we have approximately 10.6K samples in the training set and 5K samples in the test set for reward modeling.
+We use bf16 and do not use LoRA in both of the stages.
+**The resulting model achieves an evaluation loss of 0.5 and an evaluation accuracy 75.48%.**
+## Uses
+```python
+  rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/hh_rlhf_rm_open_llama_3b")
+  rm_pipe = pipeline(
+      "sentiment-analysis",
+      model="weqweasdas/hh_rlhf_rm_open_llama_3b",
+      device="auto",
+      tokenizer=rm_tokenizer,
+      model_kwargs={"torch_dtype": torch.bfloat16}
+  )
+  pipe_kwargs = {
+      "return_all_scores": True,
+      "function_to_apply": "none",
+      "batch_size": 1
+  }
+  test_texts = [
+    "###Human: My daughter wants to know how to convert fractions to decimals, but I'm not sure how to explain it. Can you help? ###Assistant: Sure. So one way of converting fractions to decimals is to ask “how many halves are there?” and then write this as a decimal number. But that's a little tricky. Here's a simpler way:  if a fraction is expressed as a/b, then it's decimal equivalent is just a/b * 1.0  So, for example, the decimal equivalent of 1/2 is 1/2 * 1.0 = 0.5.",
+    "###Human: I have fresh whole chicken in my fridge. What dish can I prepare using it that will take me less than an hour to cook? ###Assistant: Are you interested in a quick and easy recipe you can prepare with chicken you have on hand, or something more involved?  In terms of both effort and time, what are you looking for?"]
+  pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
+  rewards = [output[0]["score"] for output in pipe_outputs]
+```
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+## Reference
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+If you found this model useful, please cite our framework and paper using the following BibTeX:
+```
+@article{diao2023lmflow,
+  title={Lmflow: An extensible toolkit for finetuning and inference of large foundation models},
+  author={Diao, Shizhe and Pan, Rui and Dong, Hanze and Shum, Ka Shun and Zhang, Jipeng and Xiong, Wei and Zhang, Tong},
+  journal={arXiv preprint arXiv:2306.12420},
+  year={2023}
+}
+```
+```
+@article{dong2023raft,
+  title={Raft: Reward ranked finetuning for generative foundation model alignment},
+  author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
+  journal={arXiv preprint arXiv:2304.06767},
+  year={2023}
+}
+```