yhavinga commited on
Commit
8254be1
·
1 Parent(s): df643e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -21
README.md CHANGED
@@ -14,31 +14,67 @@ datasets:
14
  ---
15
  # GPT2-Large pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
- Dataset:
18
 
19
- * [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
20
- * dataset config: full (33B tokens)
21
 
22
- Tokenizer:
23
 
24
- * Tokenizer trained on mC4 with scripts from the Huggingface
25
- Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
 
 
 
 
26
 
27
- Training details:
 
28
 
29
- * Training started on step 360K (bs 16) ppl 21 of earlier model trained with Adam optimizer.
30
- * Training at step 1100K (53%) of 2082K (bs 32) ppl 15,1
31
- * Block size: 512
32
- * Optimizer: adafactor
33
- * Learning rate: 3.3e-5
34
- * Batch size: 32
35
- * Warmup steps: 5000
36
 
37
- Jan 2022
38
 
39
- * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
40
- * Thanks to @gsarti for creating the [t5-flax-gcp
41
- repository](https://github.com/gsarti/t5-flax-gcp).
42
- * Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
43
- [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
44
- for sharing their training scripts!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
  # GPT2-Large pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
+ A GPT2 large model (762M parameters) trained from scratch on Dutch, with perplexity 15.1 on cleaned Dutch mC4.
18
 
19
+ ## How To Use
 
20
 
21
+ You can use this GPT2-model directly with a pipeline for text generation.
22
 
23
+ ```python
24
+ MODEL_DIR='yhavinga/gpt2-large-dutch'
25
+ from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
26
+ tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
27
+ model = GPT2LMHeadModel.from_pretrained(MODEL_DIR)
28
+ generator = pipeline('text-generation', model, tokenizer=tokenizer)
29
 
30
+ generated_text = generator('Het eiland West-', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))
31
+ ```
32
 
33
+ *"Het eiland West-" - "Terschelling wordt sinds jaar en dag bewoond door de mens. De mensen die in het huidige Terherne wonen doen er alles aan om hun dorp te behouden voor deze diersoort, namelijk; een natuurreservaat dat vooral bestaat uit hoge duinen met lage begroeing waar planten van vroeger worden afgewisseld (zoals wilde hyacinten)en waarop grassen groeien waarvan sommige soorten zeldzame vormen hebben ontwikkeld: duinlelie of blauwe bosbes zijn bijvoorbeeld bekend vanwege onder andere kleurmole"*
 
 
 
 
 
 
34
 
35
+ ## Tokenizer
36
 
37
+ * BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
38
+ Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
39
+
40
+ ## Dataset
41
+
42
+ This model was trained on of the `full` configuration (33B tokens) of
43
+ [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
44
+ which is the original mC4, except
45
+
46
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
47
+ * Sentences with less than 3 words are removed
48
+ * Sentences with a word of more than 1000 characters are removed
49
+ * Documents with less than 5 sentences are removed
50
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
51
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
52
+
53
+ ## Models
54
+
55
+ TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
56
+
57
+ * `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites.
58
+ * The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
59
+
60
+ | | model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config |
61
+ |-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
62
+ | [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M | 512 | 19.9 | 2.99 | 128 | 8 | 558608 | adamw | 2.4e-3 | 1d 12h | news+wiki |
63
+ | [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 4 | 320000/520502 | adafactor | 8e-4 | 7d 2h | full |
64
+ | [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch) | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large |
65
+ | [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full |
66
+
67
+
68
+ ## Acknowledgements
69
+
70
+ This project would not have been possible without compute generously provided by Google through the
71
+ [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
72
+ instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
73
+ and training the models:
74
+
75
+ * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
76
+ * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
77
+ * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
78
+ * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
79
+
80
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)