10 Free Comprehensive Datasets for Supervised Fine-Tuning
High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.
So today, we invite you to explore top 10 free datasets on natural language processing and maths:
1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.
2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. Itβs suitable for LLM training, benchmarking, model validation.
3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.
4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.
5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.
6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.
7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
Math datasets:
1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.
Introducing ππ π’π§πππππ‘: the best public math pre-training dataset with 50B+ tokens! HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by: π οΈ carefully extracting math data from Common Crawl; π iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! π Weβre also releasing all the ablation models as well as the evaluation code.