Papers
arxiv:2403.07691

ORPO: Monolithic Preference Optimization without Reference Model

Published on Mar 12

Abstract

While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval_{2.0} (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-alpha (7B) and Mistral-ORPO-beta (7B).

Community

Streamlining the process, bit by bit. Definitely want to try this method out!

Hi here @JW17 @nlee-208 and @j6mes , first of all congrats on ORPO it’s great! I’m enjoying a lot reading it so much content and new things to learn.

Just wanted to quickly report a typo I’ve found this morning while re–reading the paper in the 4.2 section, see it highlighted below.

IMG_0120.jpeg

Thanks in advance and congrats again!

P.S. Loving Hugging Face paper pages for this, engaging with the authors is so easy! πŸ€—

Β·
Paper author
β€’
edited Mar 24

Thank you for reporting it! We will fix it for the next version of the paperπŸ˜€
(I agree that the HF paper page is awesomeπŸ‘)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your recent study caught my attention with its impressive results. The findings are noteworthy and add valuable insights to the field. I'm curious to learn more about your research. Could you please elaborate further?

In Section 4.3 Gradient of ORPO, you mention:

Specifically, 1βˆ’P(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood P(y|x) is low.

However, I believe that if P(y|x) becomes low, then 1-P(y|x) would be high, resulting in a low value for 1/(1-P(y|x)). Therefore, it seems to me that 1-P(y|x) in the denominator does not amplify the gradient when P(y|x) is low.

I apologize if I have misunderstood your work. I would greatly appreciate your clarification on this matter.

Β·

Hi toraise :)
I have the same question, did you find out anything?

It's interesting to see how much fine-tuned model deviates from the original one as you don't implicitly use KL divergence.

Sign up or log in to comment

Models citing this paper 110

Browse 110 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.07691 in a dataset README.md to link it from this page.

Spaces citing this paper 71

Collections including this paper 22