arxiv:2502.14768

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Published on Feb 20

· Submitted by

akhaliq on Feb 21

Upvote

Authors:

Zitian Gao ,

Haoming Luo ,

Bryan Dai ,

Zhirong Wu ,

Chong Luo

Abstract

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter 13 days ago

Creekside

12 days ago

Generating a 50,000 point dataset of lambda calculus at this very moment.

librarian-bot

12 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ribbitribbit365

7 days ago

We made a deep dive video for this paper: https://www.youtube.com/watch?v=IsfG3r1car0. DeepSeek R1 Reproduced & Upgraded!

why217

7 days ago

Thanks for the amazing work! May I ask a quick question on the global batch size and total training steps? In the paper it was mentioned the training set has about 5k samples, and we have a training batch size of 8 with roll out of 8. I am wondering how did we get 3600 training steps with this set up? Did we use additional gradient accumulation? Many thanks.