Bauwens's picture
Create README.md
ead6c5b verified

ULM-32k SlimPajama-3M

ULM tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.

Tokeniser details

ULM trainer implementation:

Preprocessor:

  • During training: TkTkT's SentencePiecePreprocessor
  • During inference: TkTkT's ModernEnglishPreprocessor
    1. NFKC normalisation
    2. Punctuation splitter, whitespace splitter, English contraction splitter
    3. GPT-2's pseudo-byte mapping
    4. Start-of-word marker Ġ
    5. Digit and hyphen isolation

Training details

Time: 3h40m

  • Preprocessing and counting the 3M corpus: 2h45m
  • ULM algorithm: 55m

Memory: 257 GiB peak usage (i.e. about 80 GiB RAM per million sentences).

Data sizes:

  • Examples considered: 3 000 000
  • Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
  • Characters counted: 6 685 212 190
  • Unique words after whitespace splitting: 9 254 839