metadata

license: mit

News

[2023/09/26]: UltraRM unleashes the power of UltraLM-13B-v2.0 and UltraLM-13B! A simple best-of-16 sampling achieves 92.30% (UltraLM2, 🥇 in 13B results) and 91.54% (UltraLM, 🥇 in LLaMA-1 results) win rates against text-davinci-003 on AlpacaEval benchmark!
[2023/09/26]: We release the UltraFeedback dataset, along with UltraFeedback-powered reward model UltraRM and critique model UltraCM! Both built new SOTAs over open-source models!

UltraRM

We train and release a reward model UltraRM based on UltraFeedback to further facilitate alignment research. UltraRM is initialized by LLaMA2-13B.

Specifically, we train two versions of reward models, where UltraRM-UF is merely fine-tuned on UltraFeedback and UltraRM is fine-tuned on a mixture of UltraFeedback and an equal-size sample from three open-source datasets including Anthropic HH-RLHF, Standford SHP, and Summarization.

Reward Modeling

On four public preference test sets, our UltraRM achieves SOTA over other open-source reward models.

openbmb
/

UltraRM-13b

News

Links

UltraRM

Reward Modeling

Usage