metadata
license: mit
News
- [2023/09/26]: UltraRM unleashes the power of UltraLM-13B-v2.0 and UltraLM-13B! A simple best-of-16 sampling achieves 92.30% (UltraLM2, 🥇 in 13B results) and 91.54% (UltraLM, 🥇 in LLaMA-1 results) win rates against text-davinci-003 on AlpacaEval benchmark!
- [2023/09/26]: We release the UltraFeedback dataset, along with UltraFeedback-powered reward model UltraRM and critique model UltraCM! Both built new SOTAs over open-source models!
Links
- 🤗 UltraFeedback
- 🤗 UltraRM
- 🤗 UltraCM
UltraRM
We train and release a reward model UltraRM based on UltraFeedback to further facilitate alignment research. UltraRM is initialized by LLaMA2-13B.
Specifically, we train two versions of reward models, where UltraRM-UF is merely fine-tuned on UltraFeedback and UltraRM is fine-tuned on a mixture of UltraFeedback and an equal-size sample from three open-source datasets including Anthropic HH-RLHF, Standford SHP, and Summarization.
Reward Modeling
On four public preference test sets, our UltraRM achieves SOTA over other open-source reward models.