Edit model card

This is a model released for our paper: REBEL: Reinforcement Learning via Regressing Relative Rewards.

REBEL-Llama-3-epoch_2

This model is developed with REBEL based on Meta-Llama-3-8B-Instruct with FsfairX-LLaMA3-RM-v0.1 as the reward model and UltraFeedback dataset. The training code is available at https://github.com/ZhaolinGao/REBEL. We collect online generations during each iteration with a batch size of 32.

Links to Other Model

REBEL-OpenChat-3.5

REBEL-Llama-3

REBEL-Llama-3-Armo-iter_1

REBEL-Llama-3-Armo-iter_2

REBEL-Llama-3-Armo-iter_3

Evaluations

Model AlpacaEval 2.0
LC Win Rate
AlpacaEval 2.0
Win Rate
MT-Bench
Average
MMLU
(5-shot)
GSM8K
(5-shot)
REBEL-OpenChat-3.5 17.3 12.8 8.06 63.7 68.8
REBEL-Llama-3 30.1 32.6 8.16 65.8 75.6
REBEL-Llama-3-epoch_2 31.3 34.2 7.83 65.4 75.4
REBEL-Llama-3-Armo-iter_1 48.3 41.8 8.13 66.3 75.8
REBEL-Llama-3-Armo-iter_2 50.0 48.5 8.07 65.9 75.4
REBEL-Llama-3-Armo-iter_3 49.7 48.1 8.01 66.0 75.7

Citation

Please cite our paper if you use this model in your own work:

@misc{gao2024rebel,
      title={REBEL: Reinforcement Learning via Regressing Relative Rewards}, 
      author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
      year={2024},
      eprint={2404.16767},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Cornell-AGI/REBEL-Llama-3-epoch_2

Collection including Cornell-AGI/REBEL-Llama-3-epoch_2