Spaces:
Running
Running
[Paper review] Small Models Struggle to Learn from Strong Reasoners
#19
by
lewtun
HF staff
- opened
I recently read the very nice paper Small Models Struggle to Learn from Strong Reasoners and thought I'd share my notes:
tl;dr
- they compare the performance of models SFTed in two ways:
- long CoT vs short CoT
- strong teacher vs weak teacher
- focuses on math (surprise, surprise)
- the main messages is on tables 1 & 2:
- models <= 3B fare worse with long CoT than short CoT
- models <= 3B tend to fare worse with a strong than weak teacher
- but, you can get best performance by blending either data or teachers!
- mix long and short CoT > short CoT
- mix strong and weak teacher > single teacher
bonus insights
- the long vs short CoT gap is much less for domain specific models (i.e. qwen math gets consistent boost from long CoT)
- base models fare worse than instruct ones for long CoT data
questions
- are the conclusions robust to the choice of eval?
- the gains are mostly coming from the AMC and AIME24 benchmarks. These are tiny evals of ~30 problems => high variance from getting just 1 more problem right!
- while evaluating the DeepSeek distilled models, I noticed that greedy decoding would almost always fail to close the block within the token budget
do the conclusions hold once you add online methods to the mix? - can RL "fix" the learnability gap?
- does GKD also solve the teacher gap?
Overall a good paper and one to keep in mind as we try to train smol reasoners!