GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
Datasets:
- mC4 NL Cleaned, dataset config: full (33B tokens)
- A recreation of the TBC but for the Dutch language (see e.g. https://github.com/sgraaf/Replicate-Toronto-BookCorpus)
Tokenizer:
- Tokenizer trained on mC4 with scripts from the Huggingface Transformers Flax examples
Training details:
- Trained for 320k steps (30 dec 2021)
- Block size: 512
- Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
- Warmup steps: 5000
- Weight decay: 0.01
Further fine-tuned on a Dutch book corpus.
Work in progress. Dec 2021-Jan2022
- Many thanks to the Google TPU Research Cloud for providing access to a TPU cluster!
- Thanks to @gsarti for creating the t5-flax-gcp repository.
- Also thanks to the creators of gpt2-medium-persian and gpt2-medium-indonesian for sharing their training scripts!
- Downloads last month
- 18
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.