Bauwens
/

ULM-32k_SlimPajama-3M

Model card Files Files and versions Community

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ULM-32k SlimPajama-3M

ULM tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.

Tokeniser details

ULM trainer implementation:

Back-end: SentencePiece's SentencePieceTrainer.
Front-end: TkTkT's KudoPieceTrainer

Preprocessor:

During training: TkTkT's SentencePiecePreprocessor
During inference: TkTkT's ModernEnglishPreprocessor
1. NFKC normalisation
2. Punctuation splitter, whitespace splitter, English contraction splitter
3. GPT-2's pseudo-byte mapping
4. Start-of-word marker Ġ
5. Digit and hyphen isolation

Training details

Time: 3h40m

Preprocessing and counting the 3M corpus: 2h45m
ULM algorithm: 55m

Memory: 257 GiB peak usage (i.e. about 80 GiB RAM per million sentences).

Data sizes:

Examples considered: 3 000 000
Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
Characters counted: 6 685 212 190
Unique words after whitespace splitting: 9 254 839

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including Bauwens/ULM-32k_SlimPajama-3M

Tokenisers

A collection of tokenisers I have trained (so you don't have to). • 2 items • Updated Oct 29, 2024