Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Paper • 2402.14740 • Published Feb 22 • 11