Qwen/Qwen2.5-Math-RM-72B · Preference Alignment

Sep 22, 2024

The performance of the math model is amazing, and I’m particularly curious about the use of the reward model for preference alignment. Could you provide more details on its training and usage? Specifically, how is the reward model trained from the SFT model, and what is the scale of the training data? Additionally, what drove the decision to use ORPO over other methods like DPO/APO/IPO/RLAIF? Lastly, could you elaborate on the performance improvements observed between the ORPO-tuned model and the SFT model?

tanliboy

Sep 22, 2024

Oh, I found the technical report (https://arxiv.org/pdf/2409.12122v1).

Zhenru

Qwen org Sep 23, 2024

Yes, you can find the implementation details in the technical report.

Zhenru changed discussion status to closed Sep 23, 2024

tanliboy

Sep 24, 2024

Thank you, @Zhenru !
I want to test the CoT without the python tool. Is there a flag that I can turn off the python tool during the model inference?

tanliboy

Sep 24, 2024

If I want to leverage the reasoning capabilities of both the reward model and instruct model to fine-tune for other reasoning tasks, such as step-by-step planning beyond math problems, what would you suggest?