tpo-alignment
/

Mistral-Instruct-7B-TPO-y4

alignment-handbook

Generated from Trainer

Model card Files Files and versions Community

Mistral-Instruct-7B-TPO-y4 / README.md

sahsaeedi's picture

Update README.md

60721f9 verified 5 days ago

|

history blame contribute delete

3.52 kB

	---
	base_model: mistralai/Mistral-7B-Instruct-v0.2
	tags:
	- alignment-handbook
	- generated_from_trainer
	datasets:
	- princeton-nlp/mistral-instruct-ultrafeedback
	model-index:
	- name: tpo-alignment/Mistral-Instruct-7B-TPO-y4
	results: []
	license: mit
	---
	# Mistral-Instruct-7B-TPO-y4 Model Card

	TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our [preprint](https://arxiv.org/abs/2405.16681) and [GitHub repository](https://github.com/sahsaeedi/TPO/).

	## Model Details

	### Model Description

	We fine-tuned [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the fourth-best response as the preferred response, and the lowest-scoring response as the rejected response.

	- Developed by: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
	- Model type: Causal Language Model
	- License: mistral
	- Finetuned from model: [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/sahsaeedi/TPO
	- Paper: https://arxiv.org/abs/2405.16681


	## How to Get Started with the Model
	```
	import torch
	from transformers import pipeline
	model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y4"
	generator = pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device="cuda",
	)
	outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
	do_sample=False,
	eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
	max_new_tokens=200)
	print(outputs[0]['generated_text'])
	```

	## Training Details

	### Training Data

	We use [princeton-nlp/mistral-instruct-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback) as the preference optimization dataset.

	#### Training Hyperparameters

	The hyperparameters used can be found in the [repository](https://github.com/sahsaeedi/TPO).



	## Technical Specifications

	### Model Architecture and Objective

	The model architecture is based on [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). We use the TPO training objective proposed in our [preprint](https://arxiv.org/abs/2405.16681).

	#### Hardware

	We used 8xA100 GPUs for model training.



	## Citation

	TPO paper:
	```
	@misc{saeidi2025triplepreferenceoptimizationachieving,
	title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization},
	author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
	year={2025},
	eprint={2405.16681},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2405.16681},
	}
	```