🇩🇪 GERTuraX-2

This repository hosts the GERTuraX-2 model:

GERTuraX-2 is a pretrained German encoder-only model, based on ELECTRA and pretrained with the TEAMS approach.
It was trained on 486GB of plain text from the CulturaX corpus.

Pretraining

The TensorFlow Model Garden LMs repo was used to train an ELECTRA model using the very efficient TEAMS approach.

As pretraining corpus, 486GB of plain text was extracted from the CulturaX corpus.

GERTuraX-2 uses a 64k vocab corpus (cased) and was trained for 1M steps with a batch size of 1024 and a sequence length of 512 on a v3-32 TPU Pod.

The pretraining took 5.4 days and the TensorBoard can be found here.

Evaluation

GERTuraX-2 was tested on GermEval 2014 (NER), GermEval 2018 (Sentiment analysis), CoNLL-2003 (NER) and on the ScandEval benchmark.

We use the same hyper-parameters for GermEval 2014, GermEval 2018 and CoNLL-2003 as used in the GeBERTa paper (cf. Table 5) using 5 runs with different seed and report the averaged score, conducted with the awesome Flair library.

The fine-tuning code repository can be found here.

GermEval 2014

GermEval 2014 - Original version

Model Name	Avg. Development F1-Score	Avg. Test F1-Score
GBERT Base	87.53 ± 0.22	86.81 ± 0.16
GERTuraX-1 (147GB)	88.32 ± 0.21	87.18 ± 0.12
GERTuraX-2 (486GB)	88.58 ± 0.32	87.58 ± 0.15
GERTuraX-3 (1.1TB)	88.90 ± 0.06	87.84 ± 0.18
GeBERTa Base	88.79 ± 0.16	88.03 ± 0.16

GermEval 2014 - Without Wikipedia

Model Name	Avg. Development F1-Score	Avg. Test F1-Score
GBERT Base	90.48 ± 0.34	89.05 ± 0.21
GERTuraX-1 (147GB)	91.27 ± 0.11	89.73 ± 0.27
GERTuraX-2 (486GB)	91.70 ± 0.28	89.98 ± 0.22
GERTuraX-3 (1.1TB)	91.75 ± 0.17	90.24 ± 0.27
GeBERTa Base	91.74 ± 0.23	90.28 ± 0.21

GermEval 2018

GermEval 2018 - Fine Grained

Model Name	Avg. Development F1-Score	Avg. Test F1-Score
GBERT Base	63.66 ± 4.08	51.86 ± 1.31
GERTuraX-1 (147GB)	62.87 ± 1.95	50.61 ± 0.36
GERTuraX-2 (486GB)	64.37 ± 1.31	51.02 ± 0.90
GERTuraX-3 (1.1TB)	66.39 ± 0.85	49.94 ± 2.06
GeBERTa Base	65.81 ± 3.29	52.45 ± 0.57

GermEval 2018 - Coarse Grained

Model Name	Avg. Development F1-Score	Avg. Test F1-Score
GBERT Base	83.15 ± 1.83	76.39 ± 0.64
GERTuraX-1 (147GB)	83.72 ± 0.68	77.11 ± 0.59
GERTuraX-2 (486GB)	84.51 ± 0.88	78.07 ± 0.91
GERTuraX-3 (1.1TB)	84.33 ± 1.48	78.44 ± 0.74
GeBERTa Base	83.54 ± 1.27	78.36 ± 0.79

CoNLL-2003 - German, Revised

Model Name	Avg. Development F1-Score	Avg. Test F1-Score
GBERT Base	92.15 ± 0.10	88.73 ± 0.21
GERTuraX-1 (147GB)	92.32 ± 0.14	90.09 ± 0.12
GERTuraX-2 (486GB)	92.75 ± 0.20	90.15 ± 0.14
GERTuraX-3 (1.1TB)	92.77 ± 0.28	90.83 ± 0.16
GeBERTa Base	92.87 ± 0.21	90.94 ± 0.24

ScandEval

We use v12.10.5 of ScandEval to evaluate on the following tasks:

SB10k
ScaLA-De
GermanQuAD

The package can be installed via:

$ pip3 install "scandeval[all]==12.10.5"

Results

SB10k

Evaluations on the SB10k dataset can be started like:

$ scandeval --model "deepset/gbert-base" --task sentiment-classification --language de
$ scandeval --model "ikim-uk-essen/geberta-base" --task sentiment-classification --language de
$ scandeval --model "gerturax/gerturax-1" --task sentiment-classification --language de
$ scandeval --model "gerturax/gerturax-2" --task sentiment-classification --language de
$ scandeval --model "gerturax/gerturax-3" --task sentiment-classification --language de

Model Name	Matthew's CC	Macro F1-Score
GBERT Base	59.58 ± 1.80	72.98 ± 1.20
GERTuraX-1 (147GB)	61.56 ± 2.58	74.18 ± 1.77
GERTuraX-2 (486GB)	65.24 ± 1.77	76.55 ± 1.22
GERTuraX-3 (1.1TB)	64.33 ± 2.17	75.99 ± 1.40
GeBERTa Base	59.52 ± 2.14	72.76 ± 1.50

ScaLA-De

Evaluations on the ScaLA-De dataset can be started like:

$ scandeval --model "deepset/gbert-base" --task linguistic-acceptability --language de
$ scandeval --model "ikim-uk-essen/geberta-base" --task linguistic-acceptability --language de
$ scandeval --model "gerturax/gerturax-1" --task linguistic-acceptability --language de
$ scandeval --model "gerturax/gerturax-2" --task linguistic-acceptability --language de
$ scandeval --model "gerturax/gerturax-3" --task linguistic-acceptability --language de

Model Name	Matthew's CC	Macro F1-Score
GBERT Base	52.23 ± 4.34	73.90 ± 2.68
GERTuraX-1 (147GB)	74.55 ± 1.28	86.88 ± 0.75
GERTuraX-2 (486GB)	75.83 ± 2.85	87.59 ± 1.57
GERTuraX-3 (1.1TB)	78.24 ± 1.25	88.83 ± 0.63
GeBERTa Base	59.70 ± 11.64	78.44 ± 6.12

GermanQuAD

$ scandeval --model "deepset/gbert-base" --task question-answering --language de
$ scandeval --model "ikim-uk-essen/geberta-base" --task question-answering --language de
$ scandeval --model "gerturax/gerturax-1" --task question-answering --language de
$ scandeval --model "gerturax/gerturax-2" --task question-answering --language de
$ scandeval --model "gerturax/gerturax-3" --task question-answering --language de

Model Name	Em	F1-Score
GBERT Base	12.62 ± 2.20	29.62 ± 3.86
GERTuraX-1 (147GB)	27.24 ± 1.05	52.01 ± 1.10
GERTuraX-2 (486GB)	29.54 ± 1.05	55.12 ± 0.92
GERTuraX-3 (1.1TB)	28.49 ± 1.21	54.83 ± 1.26
GeBERTa Base	28.81 ± 1.77	53.27 ± 1.92

❤️ Acknowledgements

GERTuraX is the outcome of the last 12 months of working with TPUs from the awesome TRC program and the TensorFlow Model Garden library.

Many thanks for providing TPUs!

Made from Bavarian Oberland with ❤️ and 🥨.

gerturax
/

gerturax-2