Performance on MATH dataset?

#3
by fzyzcjy - opened

Hi thanks for the LLM! I would appreciate it if I could know the MATH performance on SmolLM2 series (currently seems only GSM8K).

Hugging Face TB Research org

Hi HuggingFaceTB/SmolLM2-1.7B-Instruct scores 16.72 on MATH (4-shot)

@loubnabnl Hi, thank you very much! Btw, it seems that Llama-3.2-1B is 30.6 on MATH, and Qwen2.5-1.5B is 55.2 on MATH. Therefore, I wonder whether huggingface will create some models that is stronger in math in the future?

This comment has been hidden
Hugging Face TB Research org
edited 24 days ago

Evaluation setups can be different, in ours (which we'll share soon) Llama3.2-1B-Instruct scores 6.48 on MATH and Qwen2.5-1.5B-Instruct scores 31.07, so the model is already good at math for 1B models and we will continue to improve it in the next iterations

Thank you! That's interesting - I personally reproduced zero-shot cot llama 3.2-1B to be 27.8 etc. Looking forward to your evaluation setups!

Looking forward to your evaluation setups! +1

Hugging Face TB Research org
edited 3 days ago

UPD: the code is merged into smollm/evaluation

The MATH task will likely be updated in the mainline lighteval, but in the meantime you could add the task code to smollm/evaluation/tasks.py

And run it with

lighteval accelerate \
  --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,revision=main,dtype=bfloat16,vllm,gpu_memory_utilisation=0.8,max_model_length=2048" \
  --custom_tasks "tasks.py" --tasks "custom|math|4|1" --use_chat_template --output_dir "./evals" --save_details

Sign up or log in to comment