A collection of datasets for LLM pretraining

Hugging Face TB Research
Enterprise
community
AI & ML interests
Exploring smol models and high quality web and synthetic datasets, generated by LLMs (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)
Recent Activity
View all activity
Organization Card
HuggingFaceTB
This is the home for smol models (SmolLM & SmolVLM) and high quality pre-training datasets. We released:
- FineWeb-Edu: a filtered version of FineWeb dataset for educational content, paper available here.
- Cosmopedia: the largest open synthetic dataset, with 25B tokens and 30M samples. It contains synthetic textbooks, blog posts, and stories, posts generated by Mixtral. Blog post available here.
- Smollm-Corpus: the pre-training corpus of SmolLM: Cosmopedia v0.2, FineWeb-Edu dedup and Python-Edu. Blog post available here.
- SmolLM2 models: a series of strong small models in three sizes: 135M, 360M and 1.7B
- SmolVLM2: a family of small Video and Vision models in three sizes: 2.2B, 500M and 256M. Blog post available here.
- FineMath: the best public math pretraining dataset with 50B tokens of mathematical and problem solving data.
News 🗞️
- FineMath: the best public math pretraining dataset with 50B tokens of mathematical and problem solving data https://huggingface.co/datasets/HuggingFaceTB/finemath

Collections
13
spaces
12
Sleeping
20
SmolVLM2 XSPFGenerator (VLC prototype)
🎞
Generate video highlights and playlist
Running
17
SmolVLM2 IPhone Waitlist
⏰
sign in to receive news on the iPhone app
Running
on
A100
52
SmolVLM2 HighlightGenerator
🐨
Generate video highlights from uploaded video
Running
on
Zero
53
SmolVLM
📊
Generate text by analyzing images and videos
Running
40
SmolVLM 256M Instruct WebGPU
🐨
Generate descriptions for images using WebGPU technology
Running
31
SmolVLM 500M Instruct WebGPU
💻
models
73

HuggingFaceTB/SmolLM2-1.7B-Instruct
Text Generation
•
Updated
•
370k
•
•
573

HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text
•
Updated
•
69.3k
•
407

HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text
•
Updated
•
26.7k
•
111

HuggingFaceTB/SmolVLM-256M-Instruct
Image-Text-to-Text
•
Updated
•
36.8k
•
166

HuggingFaceTB/SmolVLM2-256M-Video-Instruct
Image-Text-to-Text
•
Updated
•
4.68k
•
39

HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Image-Text-to-Text
•
Updated
•
6.24k
•
40

HuggingFaceTB/SmolVLM2-2.2B-Instruct
Image-Text-to-Text
•
Updated
•
430k
•
109

HuggingFaceTB/SmolLM2-360M-intermediate-checkpoints
Updated
•
90

HuggingFaceTB/SmolLM2-1.7B-intermediate-checkpoints
Updated
•
794

HuggingFaceTB/SmolLM2-135M-intermediate-checkpoints
Updated
•
63
datasets
37
HuggingFaceTB/dclm-edu
Viewer
•
Updated
•
1B
•
3.19k
•
18
HuggingFaceTB/SmolLM2-intermediate-evals
Viewer
•
Updated
•
582
•
70
HuggingFaceTB/smoltalk
Viewer
•
Updated
•
2.2M
•
8.32k
•
313
HuggingFaceTB/smol-smoltalk
Viewer
•
Updated
•
485k
•
718
•
32
HuggingFaceTB/finemath
Viewer
•
Updated
•
48.3M
•
11.1k
•
292
HuggingFaceTB/everyday-conversations-llama3.1-2k
Viewer
•
Updated
•
2.38k
•
658
•
98
HuggingFaceTB/MagPie-Pro-300k-MT
Viewer
•
Updated
•
300k
•
140
HuggingFaceTB/finemath_contamination_report
Viewer
•
Updated
•
5.33k
•
108
•
1
HuggingFaceTB/math_tasks
Viewer
•
Updated
•
21.3k
•
274
•
1
HuggingFaceTB/MATH
Updated
•
160
•
4