Edit model card
YAML Metadata Error: "language[0]" with value "dutch" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.

t5-base-dutch

Created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.

See also the fine-tuned t5-base-dutch-demo model, and the demo application Netherformer 📰, that are based on this model.

5 jan 2022: Model updated. Evaluation accuracy increased from 0.64 to 0.70.

11 jan 2022: See also yhavinga/t5-v1.1-base-dutch-cased with eval acc 0.78

Model

  • Configuration based on google/t5-base
  • 12 layers, 12 heads
  • Dropout set to 0.1

Dataset

This model was trained on the full configuration of cleaned Dutch mC4, which is the original mC4, except

  • Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
  • Sentences with less than 3 words are removed
  • Sentences with a word of more than 1000 characters are removed
  • Documents with less than 5 sentences are removed
  • Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

Tokenization

A SentencePiece tokenizer was trained from scratch on this dataset. The total tokens of the full configuration is 34B

Training

The model was trained on the full mc4_nl_cleaned dataset configuration for 1 epoch, consisting of 34B tokens, for 528 482 steps with a batch size of 128 and took 57 hours. A triangle learning rate schedule was used, with peak learning rate 0.005.

Evaluation

  • Loss: 1.38
  • Accuracy: 0.70
Downloads last month
45
Inference Examples
Inference API (serverless) has been turned off for this model.

Dataset used to train flax-community/t5-base-dutch