Safetensors
English
llama
File size: 2,441 Bytes
743baa8
5775512
 
 
 
 
 
743baa8
5775512
743baa8
5775512
743baa8
5775512
467cefc
743baa8
5775512
743baa8
5775512
743baa8
5775512
743baa8
5775512
743baa8
5775512
743baa8
5775512
743baa8
5775512
743baa8
5775512
 
 
 
 
 
 
 
743baa8
5775512
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
license: apache-2.0
datasets:
- openbmb/UltraFeedback
language:
- en
base_model: meta-llama/Meta-Llama-3-8B-Instruct
---
This is a model released for our paper: [REBEL: Reinforcement Learning via Regressing Relative Rewards](https://arxiv.org/abs/2404.16767). 

# REBEL-Llama-3-Armo-iter_3

This model is developed with REBEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) as the reward model and [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset.
The training code is available at https://github.com/ZhaolinGao/REBEL. We collect offline generations of the entire dataset with best-of-5 as the chosen response and worst-of-5 as the rejected response ([Ultrafeedback-Llama-3-Armo-iter_3](https://huggingface.co/datasets/Cornell-AGI/Ultrafeedback-Llama-3-Armo-iter_3))..

### Links to Other Model

[REBEL-OpenChat-3.5](https://huggingface.co/Cornell-AGI/REBEL-OpenChat-3.5)

[REBEL-Llama-3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3)

[REBEL-Llama-3-epoch_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-epoch_2)

[REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1)

[REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2)

### Evaluations

| Model | AlpacaEval 2.0<br>LC Win Rate | AlpacaEval 2.0<br>Win Rate | MT-Bench<br>Average | MMLU<br>(5-shot) | GSM8K<br>(5-shot) |
| :--------: | :--------: |   :--------: | :--------: |  :--------: | :--------: |
| REBEL-OpenChat-3.5| 17.3 | 12.8 | 8.06 | 63.7 | 68.8 |
| REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 |
| REBEL-Llama-3-epoch_2| 31.3 | 34.2 | 7.83 | 65.4 | 75.4 |
| REBEL-Llama-3-Armo-iter_1| 48.3 | 41.8 | 8.13 | 66.3 | 75.8 |
| REBEL-Llama-3-Armo-iter_2| 50.0 | 48.5 | 8.07 | 65.9 | 75.4 |
| REBEL-Llama-3-Armo-iter_3| 49.7 | 48.1 | 8.01 | 66.0 | 75.7 |

## Citation
Please cite our paper if you use this model in your own work:
```
@misc{gao2024rebel,
      title={REBEL: Reinforcement Learning via Regressing Relative Rewards}, 
      author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
      year={2024},
      eprint={2404.16767},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```