baby-tokenizer / README.md
nilq's picture
Update README.md
3b07e52 verified
|
raw
history blame
No virus
492 Bytes
metadata
license: mit
language:
  - en

Baby Tokenizer

Compact sentencepiece tokenizer for sample-efficient English language modeling.

Data

This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:

  • CHILDES (child-directed speech)
  • Subtitles (speech)
  • BNC (speech)
  • TED talks (speech)
  • children's books (simple written language).

Specifications

  • Vocabulary size: 20k
  • Alphabet limit: 150
  • Minimum token frequency: 5