--- license: mit language: - en --- ## Baby Tokenizer Compact sentencepiece tokenizer for sample-efficient English language modeling. ### Data This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources: - CHILDES (child-directed speech) - Subtitles (speech) - BNC (speech) - TED talks (speech) - children's books (simple written language). ### Specifications - Vocabulary size: 20k - Alphabet limit: 150 - Minimum token frequency: 5