ULM-32k SlimPajama-3M
ULM tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.
Tokeniser details
ULM trainer implementation:
- Back-end: SentencePiece's
SentencePieceTrainer
. - Front-end: TkTkT's
KudoPieceTrainer
Preprocessor:
- During training: TkTkT's
SentencePiecePreprocessor
- During inference: TkTkT's
ModernEnglishPreprocessor
- NFKC normalisation
- Punctuation splitter, whitespace splitter, English contraction splitter
- GPT-2's pseudo-byte mapping
- Start-of-word marker
Ġ
- Digit and hyphen isolation
Training details
Time: 3h40m
- Preprocessing and counting the 3M corpus: 2h45m
- ULM algorithm: 55m
Memory: 257 GiB peak usage (i.e. about 80 GiB RAM per million sentences).
Data sizes:
- Examples considered: 3 000 000
- Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
- Characters counted: 6 685 212 190
- Unique words after whitespace splitting: 9 254 839