---
license: other
base_model: google/gemma-7b
tags:
- alignment-handbook
- trl
- sft
- generated_from_trainer
- trl
- sft
- generated_from_trainer
datasets:
- allenai/ultrafeedback_binarized_cleaned
model-index:
- name: MMPO_Gemma_7b_gamma1.1_epoch3
results: []
---
# MMPO_Gemma_7b_gamma1.1_epoch3
this is the model checkpoint for the paper:
**Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback**
Kyuyoung Kim*, Ah Jeong Seo*, Hao Liu, Jinwoo Shin, Kimin Lee
*In EMNLP 2024 Findings*
This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset.
The model is optimized with MMPO(Margin Matching Preference Optimization), which integrates per-feedback margin to enhance optimization.
Specifically, given quality margins in pairwise preferences, MMPO utilizes soft target probabilities based on the Bradley-Terry model.
You can find more details in the paper or the [official code](https://github.com/kykim0/margin-matching-pref-opt).
## Evaluation results
For MT-Bench, this model shows a score of 7.53, which is higher than the score of 7.40 when trained with DPO:
For RewardBench, it achieves state-of-the-art performance compared to competing models at the same scale:
## Training and evaluation data
- Training: UltraFeedback
- Evaluation: MT-Bench, RewardBench
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 16
- total_train_batch_size: 64
- total_eval_batch_size: 64
- optimizer: AdamW
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.3
- mix_precision: bfloat16
- num_epochs: 3