Preference Alignment

#6
by tanliboy - opened

The performance of the math model is amazing, and I’m particularly curious about the use of the reward model for preference alignment. Could you provide more details on its training and usage? Specifically, how is the reward model trained from the SFT model, and what is the scale of the training data? Additionally, what drove the decision to use ORPO over other methods like DPO/APO/IPO/RLAIF? Lastly, could you elaborate on the performance improvements observed between the ORPO-tuned model and the SFT model?

Oh, I found the technical report (https://arxiv.org/pdf/2409.12122v1).

Qwen org

Yes, you can find the implementation details in the technical report.

Zhenru changed discussion status to closed

Thank you, @Zhenru !
I want to test the CoT without the python tool. Is there a flag that I can turn off the python tool during the model inference?

If I want to leverage the reasoning capabilities of both the reward model and instruct model to fine-tune for other reasoning tasks, such as step-by-step planning beyond math problems, what would you suggest?

Sign up or log in to comment