Preference Alignment
The performance of the math model is amazing, and I’m particularly curious about the use of the reward model for preference alignment. Could you provide more details on its training and usage? Specifically, how is the reward model trained from the SFT model, and what is the scale of the training data? Additionally, what drove the decision to use ORPO over other methods like DPO/APO/IPO/RLAIF? Lastly, could you elaborate on the performance improvements observed between the ORPO-tuned model and the SFT model?
Oh, I found the technical report (https://arxiv.org/pdf/2409.12122v1).
Yes, you can find the implementation details in the technical report.
If I want to leverage the reasoning capabilities of both the reward model and instruct model to fine-tune for other reasoning tasks, such as step-by-step planning beyond math problems, what would you suggest?