Salesforce
/

LLaMA-3-8B-SFR-Iterative-DPO-R

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

hendrydong commited on May 14

Commit

4871bd2

•

1 Parent(s): cdaa737

Update README.md

Files changed (1) hide show

README.md +20 -6

README.md CHANGED Viewed

@@ -10,12 +10,9 @@ and strong proprietary models (e.g., GPT-3.5-turbo-0613). The model is trained w
 ## Model Releases
 - [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
-- [Reward model](https://huggingface.co/Salesforce)
 - [RLHF model](https://huggingface.co/Salesforce/SFR-Iterative-DPO-LLaMA-3-8B-R)
-## Dataset Releases
-- [Preference data mix]()
-- [Prompt collection for RLHF training]()
 ## Training methods
 We have developed a simple and efficient online RLHF recipe for LLM instruct training. Our recipe is DPO-based and thus much cheaper and simpler to train and tune compared to PPO-based approaches.
@@ -95,6 +92,23 @@ We are committed to continuous improvement in our models to minimize such risks
 ## Citation
 Please cite our techical report if you find our model is useful for your research or product.
-```
-@article{}
 ```

 ## Model Releases
 - [SFT model](https://huggingface.co/Salesforce/SFR-SFT-LLaMA-3-8B-R)
+- [Reward model](https://huggingface.co/Salesforce/SFR-RM-LLaMA-3-8B-R)
 - [RLHF model](https://huggingface.co/Salesforce/SFR-Iterative-DPO-LLaMA-3-8B-R)
 ## Training methods
 We have developed a simple and efficient online RLHF recipe for LLM instruct training. Our recipe is DPO-based and thus much cheaper and simpler to train and tune compared to PPO-based approaches.
 ## Citation
 Please cite our techical report if you find our model is useful for your research or product.
+```bibtex
+@misc{dong2024rlhf,
+      title={RLHF Workflow: From Reward Modeling to Online RLHF},
+      author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
+      year={2024},
+      eprint={2405.07863},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+@misc{xiong2024iterative,
+      title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
+      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
+      year={2024},
+      eprint={2312.11456},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
 ```