Post
1037
Introducing ππ
π’π§πππππ‘: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by:
π οΈ carefully extracting math data from Common Crawl;
π iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! π
Weβre also releasing all the ablation models as well as the evaluation code.
HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by:
π οΈ carefully extracting math data from Common Crawl;
π iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! π
Weβre also releasing all the ablation models as well as the evaluation code.
HuggingFaceTB/finemath-6763fb8f71b6439b653482c2