dpo - a xieyuquan Collection

xieyuquan 's Collections

rlhf

arch

dpo

dpo

updated 3 days ago

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published 19 days ago • 36
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Paper • 2406.12168 • Published 15 days ago • 7
WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published 16 days ago • 13
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published 7 days ago • 36