Understanding and Diagnosing Deep Reinforcement Learning Paper • 2406.16979 • Published 16 days ago • 8
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences Paper • 2404.03715 • Published Apr 4 • 58
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning Paper • 2407.00617 • Published 9 days ago • 6
Offline Regularised Reinforcement Learning for Large Language Models Alignment Paper • 2405.19107 • Published May 29 • 12
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging Paper • 2407.01470 • Published 8 days ago • 5
Understanding the performance gap between online and offline alignment algorithms Paper • 2405.08448 • Published May 14 • 14
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF Paper • 2405.19320 • Published May 29 • 9
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published May 20 • 33
WPO: Enhancing RLHF with Weighted Preference Optimization Paper • 2406.11827 • Published 22 days ago • 13
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy Paper • 2406.20095 • Published 11 days ago • 17
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning Paper • 2406.11896 • Published 25 days ago • 17
Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning Paper • 2406.00392 • Published Jun 1 • 12