metadata
library_name: transformers
datasets:
- HuggingFaceTB/smollm-corpus
Doge-tokenizer
Tokenizer for the training model on smollm-corpus. This tokenizer was trained on 2M samples from:
- FineWeb-Edu 70%
- Cosmopedia v2 20%
- Python-Edu 5%
- FineMath 5%