Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning Paper • 2407.00617 • Published 6 days ago • 6 • 1