Performance on MATH dataset?
Hi thanks for the LLM! I would appreciate it if I could know the MATH performance on SmolLM2 series (currently seems only GSM8K).
Hi HuggingFaceTB/SmolLM2-1.7B-Instruct
scores 16.72 on MATH (4-shot)
@loubnabnl Hi, thank you very much! Btw, it seems that Llama-3.2-1B is 30.6 on MATH, and Qwen2.5-1.5B is 55.2 on MATH. Therefore, I wonder whether huggingface will create some models that is stronger in math in the future?
Evaluation setups can be different, in ours (which we'll share soon) Llama3.2-1B-Instruct scores 6.48 on MATH and Qwen2.5-1.5B-Instruct scores 31.07, so the model is already good at math for 1B models and we will continue to improve it in the next iterations
Thank you! That's interesting - I personally reproduced zero-shot cot llama 3.2-1B to be 27.8 etc. Looking forward to your evaluation setups!
Looking forward to your evaluation setups! +1
UPD: the code is merged into smollm/evaluation
The MATH task will likely be updated in the mainline lighteval
, but in the meantime you could add the task code to smollm/evaluation/tasks.py
And run it with
lighteval accelerate \
--model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,revision=main,dtype=bfloat16,vllm,gpu_memory_utilisation=0.8,max_model_length=2048" \
--custom_tasks "tasks.py" --tasks "custom|math|4|1" --use_chat_template --output_dir "./evals" --save_details