--- license: other base_model: google/gemma-7b tags: - alignment-handbook - trl - sft - generated_from_trainer - trl - sft - generated_from_trainer datasets: - allenai/ultrafeedback_binarized_cleaned model-index: - name: MMPO_Gemma_7b_gamma1.1_epoch3 results: [] --- # MMPO_Gemma_7b_gamma1.1_epoch3 this is the model checkpoint for the paper: **Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback**
Kyuyoung Kim*, Ah Jeong Seo*, Hao Liu, Jinwoo Shin, Kimin Lee
*In EMNLP 2024 Findings* This model is a fine-tuned version of [kykim0/gemma-7b-ultrachat-sft](https://huggingface.co/kykim0/gemma-7b-ultrachat-sft) on the [allenai/ultrafeedback_binarized_cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset. The model is optimized with MMPO(Margin Matching Preference Optimization), which integrates per-feedback margin to enhance optimization. Specifically, given quality margins in pairwise preferences, MMPO utilizes soft target probabilities based on the Bradley-Terry model. You can find more details in the paper or the [official code](https://github.com/kykim0/margin-matching-pref-opt). ## Evaluation results For MT-Bench, this model shows a score of 7.53, which is higher than the score of 7.40 when trained with DPO: For RewardBench, it achieves state-of-the-art performance compared to competing models at the same scale: ## Training and evaluation data - Training: UltraFeedback - Evaluation: MT-Bench, RewardBench ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-07 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 4 - gradient_accumulation_steps: 16 - total_train_batch_size: 64 - total_eval_batch_size: 64 - optimizer: AdamW - lr_scheduler_type: cosine - lr_scheduler_warmup_ratio: 0.3 - mix_precision: bfloat16 - num_epochs: 3