open-r1/README · [Paper review] Small Models Struggle to Learn from Strong Reasoners

I recently read the very nice paper Small Models Struggle to Learn from Strong Reasoners and thought I'd share my notes:

tl;dr

they compare the performance of models SFTed in two ways:
- long CoT vs short CoT
- strong teacher vs weak teacher
focuses on math (surprise, surprise)
the main messages is on tables 1 & 2:
- models <= 3B fare worse with long CoT than short CoT
- models <= 3B tend to fare worse with a strong than weak teacher
but, you can get best performance by blending either data or teachers!
- mix long and short CoT > short CoT
- mix strong and weak teacher > single teacher

bonus insights

the long vs short CoT gap is much less for domain specific models (i.e. qwen math gets consistent boost from long CoT)
base models fare worse than instruct ones for long CoT data

questions

are the conclusions robust to the choice of eval?
the gains are mostly coming from the AMC and AIME24 benchmarks. These are tiny evals of ~30 problems => high variance from getting just 1 more problem right!
while evaluating the DeepSeek distilled models, I noticed that greedy decoding would almost always fail to close the block within the token budget
do the conclusions hold once you add online methods to the mix?
can RL "fix" the learnability gap?
does GKD also solve the teacher gap?

Overall a good paper and one to keep in mind as we try to train smol reasoners!