File size: 3,524 Bytes
6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 6f2386d a6a94bb 60721f9 a6a94bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
---
base_model: mistralai/Mistral-7B-Instruct-v0.2
tags:
- alignment-handbook
- generated_from_trainer
datasets:
- princeton-nlp/mistral-instruct-ultrafeedback
model-index:
- name: tpo-alignment/Mistral-Instruct-7B-TPO-y4
results: []
license: mit
---
# Mistral-Instruct-7B-TPO-y4 Model Card
TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our [preprint](https://arxiv.org/abs/2405.16681) and [GitHub repository](https://github.com/sahsaeedi/TPO/).
## Model Details
### Model Description
We fine-tuned [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the fourth-best response as the preferred response, and the lowest-scoring response as the rejected response.
- **Developed by:** Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
- **Model type:** Causal Language Model
- **License:** mistral
- **Finetuned from model:** [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/sahsaeedi/TPO
- **Paper:** https://arxiv.org/abs/2405.16681
## How to Get Started with the Model
```
import torch
from transformers import pipeline
model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y4"
generator = pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
do_sample=False,
eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
max_new_tokens=200)
print(outputs[0]['generated_text'])
```
## Training Details
### Training Data
We use [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) as the preference optimization dataset.
#### Training Hyperparameters
The hyperparameters used can be found in the [repository](https://github.com/sahsaeedi/TPO).
## Technical Specifications
### Model Architecture and Objective
The model architecture is based on [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). We use the TPO training objective proposed in our [preprint](https://arxiv.org/abs/2405.16681).
#### Hardware
We used 8xA100 GPUs for model training.
## Citation
TPO paper:
```
@misc{saeidi2025triplepreferenceoptimizationachieving,
title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
year={2025},
eprint={2405.16681},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.16681},
}
``` |