yhavinga
/

t5-small-24L-ccmatrix-multi

@@ -1,22 +1,150 @@
 ---
-language:
-- en
 - nl
 datasets:
 - yhavinga/mc4_nl_cleaned
 - yhavinga/ccmatrix
 tags:
 - translation
 license: apache-2.0
 ---
-## PreTraining
-The model was pre-trained on a English and Dutch mC4 cleaned.
-## Finetuning
-The model was finetuned on CCMatrix, validated on Tatoeba.
- * **128-max token length**
- * Only the first 25M sentences of CCMatrix were used, both en->nl and en->nl (total 50M sentences).
- * Note: multi-direction. Prepend either `translate Dutch to English: ` or `translate English to Dutch: `

 ---
+language:
 - nl
 datasets:
 - yhavinga/mc4_nl_cleaned
 - yhavinga/ccmatrix
 tags:
+- t5
 - translation
+- seq2seq
+pipeline_tag: translation
+widget:
+- text: "It is a painful and tragic spectacle that rises before me: I have drawn back the curtain from the rottenness of man. This word, in my mouth, is at least free from one suspicion: that it involves a moral accusation against humanity. It is used--and I wish to emphasize the fact again--without any moral significance: and this is so far true that the rottenness I speak of is most apparent to me precisely in those quarters where there has been most aspiration, hitherto, toward 'virtue' and 'godliness.'"
+- text: "For once Fletcher’s sedate features showed a certain lightness. 'I believe I will linger awhile longer.' He indicated a holoscreen which was displaying the image from an external camera. Cloud-splattered landscape was rolling past, pastel greens, browns, and blues illuminated by Duke’s radiance. 'It is not often a mortal man is permitted to view a world over the shoulder of angels.'"
 license: apache-2.0
 ---
+# t5-small-24L-ccmatrix-multi
+A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) sequence to sequence model
+pre-trained from scratch on [cleaned Dutch 🇳🇱🇧🇪 mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned).
+This **t5 eff** model has **249M** parameters.
+It was pre-trained on the dataset
+`mc4_nl_cleaned` config `large_en_nl` for **1** epoch(s) and a duration of **4d10h**,
+with a sequence length of **512**, batch size **128** and **851852** total steps.
+Pre-training evaluation loss and accuracy are **1,18** and **0,74**.
+## Tokenizer
+The model uses a cased SentencePiece tokenizer configured with the `Nmt, NFKC, Replace multi-space to single-space` normalizers
+and has 32003 tokens.
+It was trained on Dutch and English  with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
+See [./raw/main/tokenizer.json](tokenizer.json) for details.
+## Dataset
+All models listed below are trained on
+[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
+which is the original mC4, except
+  * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
+  * Sentences with less than 3 words are removed
+  * Sentences with a word of more than 1000 characters are removed
+  * Documents with less than 5 sentences are removed
+  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
+    "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
+The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.
+## Models
+Three types of models have been trained. `t5-base-dutch` is the only model with an original T5 config.
+The other model types t5-v1.1 and t5-eff have `gated-relu` instead of `relu` as activation function,
+and trained with a drop-out of `0.0` unless training would diverge (`t5-v1.1-large-dutch-cased`).
+The T5-eff models are models with mostly different numbers of layers. The table will list
+the several dimensions of these models. Note that `efficient` is a misnomer for models with few layers,
+e.g. `t5-xl-4L-dutch-english-cased`, that is not efficient and one of the worst models on downstream summarization.
+|                   | t5-base-dutch   | t5-v1.1-base-dutch-uncased   | t5-v1.1-base-dutch-cased   | t5-v1.1-large-dutch-cased   | t5-v1_1-base-dutch-english-cased   | t5-v1_1-base-dutch-english-cased-1024   | t5-small-24L-dutch-english   | t5-xl-4L-dutch-english-cased   | t5-base-36L-dutch-english-cased   | t5-eff-xl-8l-dutch-english-cased   | t5-eff-large-8l-dutch-english-cased   |
+|:------------------|:----------------|:-----------------------------|:---------------------------|:----------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:-----------------------------------|:--------------------------------------|
+| type              | t5              | t5-v1.1                      | t5-v1.1                    | t5-v1.1                     | t5-v1.1                            | t5-v1.1                                 | t5 eff                       | t5 eff                         | t5 eff                            | t5 eff                             | t5 eff                                |
+| d_model           | 768             | 768                          | 768                        | 1024                        | 768                                | 768                                     | 512                          | 2048                           | 768                               | 1024                               | 1024                                  |
+| d_ff              | 3072            | 2048                         | 2048                       | 2816                        | 2048                               | 2048                                    | 1920                         | 5120                           | 2560                              | 16384                              | 4096                                  |
+| num_heads         | 12              | 12                           | 12                         | 16                          | 12                                 | 12                                      | 8                            | 32                             | 12                                | 32                                 | 16                                    |
+| d_kv              | 64              | 64                           | 64                         | 64                          | 64                                 | 64                                      | 64                           | 64                             | 64                                | 128                                | 64                                    |
+| num_layers        | 12              | 12                           | 12                         | 24                          | 12                                 | 12                                      | 24                           | 4                              | 36                                | 8                                  | 8                                     |
+| num parameters    | 223M            | 248M                         | 248M                       | 783M                        | 248M                               | 248M                                    | 250M                         | 585M                           | 729M                              | 1241M                              | 335M                                  |
+| feed_forward_proj | relu            | gated-gelu                   | gated-gelu                 | gated-gelu                  | gated-gelu                         | gated-gelu                              | gated-gelu                   | gated-gelu                     | gated-gelu                        | gated-gelu                         | gated-gelu                            |
+| dropout           | 0.1             | 0.0                          | 0.0                        | 0.1                         | 0.0                                | 0.0                                     | 0.0                          | 0.1                            | 0.0                               | 0.0                                | 0.0                                   |
+| dataset           | mc4_nl_cleaned  | mc4_nl_cleaned full          | mc4_nl_cleaned full        | mc4_nl_cleaned              | mc4_nl_cleaned small_en_nl         | mc4_nl_cleaned large_en_nl              | mc4_nl_cleaned large_en_nl   | mc4_nl_cleaned large_en_nl     | mc4_nl_cleaned large_en_nl        | mc4_nl_cleaned large_en_nl         | mc4_nl_cleaned large_en_nl            |
+| tr. seq len       | 512             | 1024                         | 1024                       | 512                         | 512                                | 1024                                    | 512                          | 512                            | 512                               | 512                                | 512                                   |
+| batch size        | 128             | 64                           | 64                         | 64                          | 128                                | 64                                      | 128                          | 512                            | 512                               | 64                                 | 128                                   |
+| total steps       | 527500          | 1014525                      | 1210154                    | 2427498                     | 2839630                            | 1520k/3397024                           | 851852                       | 212963                         | 212963                            | 538k/1703705                       | 851850                                |
+| epochs            | 1               | 2                            | 2                          | 2                           | 10                                 | 4                                       | 1                            | 1                              | 1                                 | 1                                  | 1                                     |
+| duration          | 2d9h            | 5d5h                         | 6d6h                       | 8d13h                       | 11d18h                             | 9d1h                                    | 4d10h                        | 6d1h                           | 17d15h                            | 4d 19h                             | 3d 23h                                |
+| optimizer         | adafactor       | adafactor                    | adafactor                  | adafactor                   | adafactor                          | adafactor                               | adafactor                    | adafactor                      | adafactor                         | adafactor                          | adafactor                             |
+| lr                | 0.005           | 0.005                        | 0.005                      | 0.005                       | 0.005                              | 0.005                                   | 0.005                        | 0.005                          | 0.009                             | 0.005                              | 0.005                                 |
+| warmup            | 10000.0         | 10000.0                      | 10000.0                    | 10000.0                     | 10000.0                            | 5000.0                                  | 20000.0                      | 2500.0                         | 1000.0                            | 1500.0                             | 1500.0                                |
+| eval loss         | 1,38            | 1,20                         | 0,96                       | 1,07                        | 1,11                               | 1,13                                    | 1,18                         | 1,27                           | 1,05                              | 1,3019                             | 1,15                                  |
+| eval acc          | 0,70            | 0,73                         | 0,78                       | 0,76                        | 0,75                               | 0,74                                    | 0,74                         | 0,72                           | 0,76                              | 0,71                               | 0,74                                  |
+## Evaluation on summarization
+The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset.
+All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a
+warmup of 64 steps, with a label smoothing factor of 0.05.
+Article and summary token lengths were set to 1024 and 142.
+|                    | t5-base-dutch   | t5-v1.1-base-dutch-uncased   | t5-v1.1-base-dutch-cased   | t5-v1_1-base-dutch-english-cased   | t5-v1_1-base-dutch-english-cased-1024   | t5-small-24L-dutch-english   | t5-xl-4L-dutch-english-cased   | t5-base-36L-dutch-english-cased   | t5-eff-large-8l-dutch-english-cased   | mt5-base   |
+|:-------------------|:----------------|:-----------------------------|:---------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:--------------------------------------|:-----------|
+| rouge1             | 33.0313         | 33.8432                      | 34.0906                    | 33.1116                            | 34.6465                                 | 34.376                       | 30.8983                        | 35.0931                           | 33.9293                               | 33.6466    |
+| rouge2             | 12.9452         | 13.7706                      | 13.6203                    | 13.275                             | 13.8525                                 | 13.8939                      | 11.6005                        | 14.3823                           | 13.6274                               | 13.1085    |
+| rougeL             | 23.7204         | 24.5642                      | 24.7304                    | 24.3561                            | 24.721                                  | 25.2496                      | 22.6536                        | 25.3213                           | 24.5595                               | 23.909     |
+| rougeLsum          | 29.842          | 30.7783                      | 31.1438                    | 30.0548                            | 31.6104                                 | 31.3838                      | 27.8467                        | 32.3526                           | 30.952                                | 30.5054    |
+| gen_len            | 90.488          | 91.832                       | 92.122                     | 89.583                             | 98.333                                  | 90.442                       | 92.342                         | 96.832                            | 95.057                                | 96.312     |
+| num parameters     | 223M            | 248M                         | 248M                       | 248M                               | 248M                                    | 250M                         | 585M                           | 729M                              | 335M                                  | 582M       |
+| samples_per_second | 3.195           | 3.039                        | 3.0                        | 3.216                              | 2.974                                   | 1.594                        | 2.47                           | 0.623                             | 3.087                                 | 1.201      |
+## Translation models
+The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset.
+The models named *-`multi` support both directions of translation. The models are trained on CCMatrix only. As this is
+a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it,
+refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also
+on Tatoeba and Opus Books. The `_bp` columns list the *brevity penalty*. The `avg_bleu` score is the bleu score
+averaged over all three evaluation datasets.
+The translation metrics are listed in the table below:
+|                        | t5-base-36L-ccmatrix-en-nl   | t5-base-36L-ccmatrix-multi   | t5-base-36L-ccmatrix-multi   | t5-small-24L-ccmatrix-multi   | t5-small-24L-ccmatrix-multi   |
+|:-----------------------|:-----------------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
+| id                     | 0                            | 14                           | 15                           | 16                            | 20                            |
+| source_lang            | en                           | en                           | nl                           | en                            | nl                            |
+| target_lang            | nl                           | nl                           | en                           | nl                            | en                            |
+| source_prefix          | translate English to Dutch:  | translate English to Dutch:  | translate Dutch to English:  | translate English to Dutch:   | translate Dutch to English:   |
+| tatoeba_bp             | 0.9897614370103832           | 0.9736173618072754           | 0.943521164106552            | 0.9760983304454847            | 0.9406676405486575            |
+| ccmatrix_bp            | 0.9590750786190209           | 0.9536276245543676           | 0.9635673583308255           | 0.9517934939463099            | 0.9585648049711814            |
+| opus_books_bp          | 0.7478011343203491           | 0.7950194726093107           | 0.9362852511299413           | 0.770498474692027             | 0.8870675076932444            |
+| tatoeba_score          | 50.63006965176505            | 46.580601850286214           | 52.82030981131822            | 46.419809813946046            | 51.67887417355214             |
+| ccmatrix_score         | 60.33227938980884            | 56.81297258845844            | 62.836646082246254           | 57.404319674892406            | 63.08633155239932             |
+| opus_books_score       | 10.405013868050663           | 13.477997378535864           | 24.93113308798125            | 12.927244801365507            | 23.418552148252047            |
+| avg_bleu               | 40.455787636541515           | 38.95719060576017            | 46.86269632718191            | 38.91712476340132             | 46.0612526247345              |
+| total steps            | 78125                        | 390625                       | 390625                       | 390625                        | 390625                        |
+| duration               | 14h                          | 101h                         | 101h                         | 74h                           | 74h                           |
+| num_parameters         | 728928000                    | 728928000                    | 728928000                    | 249991680                     | 249991680                     |
+| label_smoothing_factor | 0.09                         | 0.15                         | 0.15                         | 0.1                           | 0.1                           |
+| learning_rate          | 0.0001                       | 5e-05                        | 5e-05                        | 0.0005                        | 0.0005                        |
+## Acknowledgements
+This project would not have been possible without compute generously provided by Google through the
+[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem and was also
+instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many
+models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would
+have completed this project otherwise.
+The following repositories where helpful in setting up the TPU-VM,
+and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
+* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
+* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)