[Paper review] Small Models Struggle to Learn from Strong Reasoners

#19
by lewtun HF staff - opened
Open R1 org

I recently read the very nice paper Small Models Struggle to Learn from Strong Reasoners and thought I'd share my notes:

tl;dr

  • they compare the performance of models SFTed in two ways:
    • long CoT vs short CoT
    • strong teacher vs weak teacher
  • focuses on math (surprise, surprise)
  • the main messages is on tables 1 & 2:
    • models <= 3B fare worse with long CoT than short CoT
    • models <= 3B tend to fare worse with a strong than weak teacher
  • but, you can get best performance by blending either data or teachers!
    • mix long and short CoT > short CoT
    • mix strong and weak teacher > single teacher

bonus insights

  • the long vs short CoT gap is much less for domain specific models (i.e. qwen math gets consistent boost from long CoT)
  • base models fare worse than instruct ones for long CoT data

questions

  • are the conclusions robust to the choice of eval?
  • the gains are mostly coming from the AMC and AIME24 benchmarks. These are tiny evals of ~30 problems => high variance from getting just 1 more problem right!
  • while evaluating the DeepSeek distilled models, I noticed that greedy decoding would almost always fail to close the block within the token budget
    do the conclusions hold once you add online methods to the mix?
  • can RL "fix" the learnability gap?
  • does GKD also solve the teacher gap?

Overall a good paper and one to keep in mind as we try to train smol reasoners!

Sign up or log in to comment