license: mit | |
language: | |
- en | |
## Baby Tokenizer | |
Compact sentencepiece tokenizer for sample-efficient English language modeling. | |
### Data | |
This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources: | |
- CHILDES (child-directed speech) | |
- Subtitles (speech) | |
- BNC (speech) | |
- TED talks (speech) | |
- children's books (simple written language). | |
### Specifications | |
- Vocabulary size: 20k | |
- Alphabet limit: 150 | |
- Minimum token frequency: 5 |