HuggingFaceFW/fineweb
Viewer
โข
Updated
โข
25B
โข
317k
โข
2.02k
A collection of datasets for LLM pretraining
Note ๐ท Web datasets
Note ๐ Highly curated web datasets filtered using classifiers
Note ๐ Highly curated math pages from CommonCrawl
Note ๐ป Github code dataset
Note Synthetic textbooks
Note Contains Cosmopedia v2 (synthetic textbooks) and Python-Edu (educational Python code)