|
--- |
|
base_model: mistralai/Mistral-7B-Instruct-v0.2 |
|
tags: |
|
- alignment-handbook |
|
- generated_from_trainer |
|
datasets: |
|
- princeton-nlp/mistral-instruct-ultrafeedback |
|
model-index: |
|
- name: tpo-alignment/Mistral-Instruct-7B-TPO-y4 |
|
results: [] |
|
license: mit |
|
--- |
|
# Mistral-Instruct-7B-TPO-y4 Model Card |
|
|
|
TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our [preprint](https://arxiv.org/abs/2405.16681) and [GitHub repository](https://github.com/sahsaeedi/TPO/). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
We fine-tuned [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the fourth-best response as the preferred response, and the lowest-scoring response as the rejected response. |
|
|
|
- **Developed by:** Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral |
|
- **Model type:** Causal Language Model |
|
- **License:** mistral |
|
- **Finetuned from model:** [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/sahsaeedi/TPO |
|
- **Paper:** https://arxiv.org/abs/2405.16681 |
|
|
|
|
|
## How to Get Started with the Model |
|
``` |
|
import torch |
|
from transformers import pipeline |
|
model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y4" |
|
generator = pipeline( |
|
"text-generation", |
|
model=model_id, |
|
model_kwargs={"torch_dtype": torch.bfloat16}, |
|
device="cuda", |
|
) |
|
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}], |
|
do_sample=False, |
|
eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id], |
|
max_new_tokens=200) |
|
print(outputs[0]['generated_text']) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
We use [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) as the preference optimization dataset. |
|
|
|
#### Training Hyperparameters |
|
|
|
The hyperparameters used can be found in the [repository](https://github.com/sahsaeedi/TPO). |
|
|
|
|
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
The model architecture is based on [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). We use the TPO training objective proposed in our [preprint](https://arxiv.org/abs/2405.16681). |
|
|
|
#### Hardware |
|
|
|
We used 8xA100 GPUs for model training. |
|
|
|
|
|
|
|
## Citation |
|
|
|
TPO paper: |
|
``` |
|
@misc{saeidi2025triplepreferenceoptimizationachieving, |
|
title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, |
|
author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral}, |
|
year={2025}, |
|
eprint={2405.16681}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2405.16681}, |
|
} |
|
``` |