baby-tokenizer / README.md
nilq's picture
Update README.md
3b07e52 verified
|
raw
history blame
No virus
492 Bytes
---
license: mit
language:
- en
---
## Baby Tokenizer
Compact sentencepiece tokenizer for sample-efficient English language modeling.
### Data
This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
- CHILDES (child-directed speech)
- Subtitles (speech)
- BNC (speech)
- TED talks (speech)
- children's books (simple written language).
### Specifications
- Vocabulary size: 20k
- Alphabet limit: 150
- Minimum token frequency: 5