YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ULM-32k SlimPajama-3M

ULM tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.

Tokeniser details

ULM trainer implementation:

Preprocessor:

  • During training: TkTkT's SentencePiecePreprocessor
  • During inference: TkTkT's ModernEnglishPreprocessor
    1. NFKC normalisation
    2. Punctuation splitter, whitespace splitter, English contraction splitter
    3. GPT-2's pseudo-byte mapping
    4. Start-of-word marker Ġ
    5. Digit and hyphen isolation

Training details

Time: 3h40m

  • Preprocessing and counting the 3M corpus: 2h45m
  • ULM algorithm: 55m

Memory: 257 GiB peak usage (i.e. about 80 GiB RAM per million sentences).

Data sizes:

  • Examples considered: 3 000 000
  • Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
  • Characters counted: 6 685 212 190
  • Unique words after whitespace splitting: 9 254 839
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including Bauwens/ULM-32k_SlimPajama-3M