File size: 3,234 Bytes
d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 f6fae48 d276df6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
language:
- en
license: apache-2.0
datasets:
- openbmb/UltraFeedback
pipeline_tag: text-generation
model-index:
- name: GPO-Llama-3-8B-Instruct-GPM-2B
results: []
---
General Preference Modeling with Preference Representations for Aligning Language Models (https://arxiv.org/abs/2410.02197)
# GPO-Llama-3-8B-Instruct-GPM-2B
This model was developed using [General Preference Optimization (GPO)](https://arxiv.org/abs/2405.00675) at iteration 3 and the [General Preference representation Model (GPM)](https://arxiv.org/abs/2410.02197) (specifically, using [GPM-Gemma-2B](https://huggingface.co/general-preference/GPM-Gemma-2B)), based on the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
## Links to Other Models
- [SPPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/SPPO-Llama-3-8B-Instruct-GPM-2B)
- [GPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/GPO-Llama-3-8B-Instruct-GPM-2B)
### Model Description
- Model type: A 8B parameter GPT-like model fine-tuned on synthetic datasets.
- Language(s) (NLP): Primarily English
- License: Apache-2.0
- Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct
## [AlpacaEval Leaderboard Evaluation Results](https://tatsu-lab.github.io/alpaca_eval/)
| Model | LC. Win Rate | Win Rate | Avg. Length |
|-------------------------------------------|:------------:|:--------:|:-----------:|
|[GPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/GPO-Llama-3-8B-Instruct-GPM-2B) | 38.43 | 48.87 | 2613
## [Open LLM Leaderboard Evaluation Results](https://github.com/EleutherAI/lm-evaluation-harness)
Results are reported by using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.1
| | arc_challenge | truthfulqa_mc2 | winogrande | gsm8k | hellaswag | mmlu | average |
|--------|---------------|----------------|------------|-------|-----------|-------|---------|
|[GPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/GPO-Llama-3-8B-Instruct-GPM-2B) | 61.43 | 53.54 | 75.22 | 76.12 | 78.06 | 65.65 | 68.34
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- beta: 0.001
- per_device_train_batch_size: 8
- gradient_accumulation_steps: 1
- seed: 42
- distributed_type: deepspeed_zero3
- num_devices: 8
- optimizer: RMSProp
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_train_epochs: 6.0 (stop at epoch=1.0)
## Citation
```
@article{zhang2024general,
title={General Preference Modeling with Preference Representations for Aligning Language Models},
author={Zhang, Yifan and Zhang, Ge and Wu, Yue and Xu, Kangping and Gu, Quanquan},
journal={arXiv preprint arXiv:2410.02197},
year={2024}
}
```
|