File size: 4,392 Bytes
e22d013
 
 
 
 
 
77a04ac
e22d013
 
acb1449
77a04ac
 
 
 
 
 
 
 
 
 
 
 
 
e22d013
eda3324
e22d013
 
 
 
 
 
 
acb1449
 
addee90
77eedc0
6162bec
77eedc0
e22d013
77eedc0
e22d013
 
 
77a04ac
 
 
f840361
77a04ac
 
 
 
 
 
 
 
 
 
77eedc0
babf87e
77eedc0
 
77a04ac
e22d013
77a04ac
e22d013
e1884d2
9d9bae0
 
 
 
 
e1884d2
 
e22d013
abecbfc
 
77a04ac
e22d013
77a04ac
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{}
---

# GeBERTa

<!-- Provide a quick summary of what the model is/does. -->
GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. 
The models range in size from 122M to 750M parameters. 


## Model details

The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary, 
while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps 
and have a maximum sequence length of 512 tokens.


## Dataset

The pre-training dataset consists of documents from different domains:

| Domain | Dataset | Data Size | #Docs | #Tokens |
| -------- | ----------- | --------- | ------ | ------- |
| Formal | Wikipedia | 9GB | 2,665,357 | 1.9B |
| Formal | News | 28GB | 12,305,326 | 6.1B |
| Formal | GC4 | 90GB | 31,669,772 | 19.4B |
| Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B |
| Informal | Holiday Reviews | 2GB | 4,876,405 | 428M |
| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B |
| Medical | Smaller public datasets | 253MB | 179,776 | 50M |
| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M |
| Medical | Medicine Dissertations | 1.4GB | 14,496 | 295M |
| Medical | Pubmed abstracts (translated) | 8.5GB | 21,044,382 | 1.7B |
| Medical | MIMIC III (translated) | 2.6GB | 24,221,834 | 695M |
| Medical | PMC-Patients-ReCDS (translated) | 2.1GB | 1,743,344 | 414M |
| Literature | German Fiction | 1.1GB | 3,219 | 243M |
| Literature | English books (translated) | 7.1GB | 11,038 | 1.6B |
| - | Total | 167GB | 116,079,769 | 35.8B |


## Benchmark

In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering, 
classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets. 
When the datasets provided training, development, and test sets, we used them accordingly.



We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available. 
The following table presents the F1 scores:


|         Model         |   [GE14](https://huggingface.co/datasets/germeval_14)   |  [GQuAD](https://huggingface.co/datasets/deepset/germanquad)  |   [GE18](https://huggingface.co/datasets/philschmid/germeval18)   |    TS    |   [GGP](https://github.com/JULIELab/GGPOnc)   |  GRAS<sup>1</sup>  |    [JS](https://github.com/JULIELab/jsyncc)    |  [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release)  |  Avg   |
|:---------------------:|:--------:|:----------:|:--------:|:--------:|:-------:|:------:|:--------:|:------:|:------:|
|     [GBERT](https://huggingface.co/deepset/gbert-base)<sub>base</sub>         | 87.10±0.12 | 72.19±0.82 | 51.27±1.4 | 72.34±0.48 | 78.17±0.25 | 62.90±0.01 | 77.18±3.34 | 88.03±0.20 | 73.65±0.50 |
|   [GELECTRA](https://huggingface.co/deepset/gelectra-base)<sub>base</sub>   | 86.19±0.5 | 74.09±0.70 | 48.02±1.80 | 70.62±0.44 | 77.53±0.11 | 65.97±0.01 | 71.17±2.94 | 88.06±0.37 | 72.71±0.66 |
|  [GottBERT](https://huggingface.co/uklfr/gottbert-base)  | 87.15±0.19 | 72.76±0.378 | 51.12±1.20 | 74.25±0.80 | **78.18**±0.11 | 65.71±0.01 | 74.60±4.75 | 88.61±0.23 | 74.05±0.51 |
| GeBERTa<sub>base</sub> | **88.06**±0.22 | **78.54**±0.32 | **53.16**±1.39 | **74.83**±0.36 | 78.13±0.15 | **68.37**±1.11 | **81.85**±5.23 | **89.14**±0.32 | **76.51**±0.32 |


## Publication

```bibtex
@inproceedings{dada2023impact,
  title={On the Impact of Cross-Domain Data on German Language Models},
  author={Dada, Amin and Chen, Aokun and Peng, Cheng and Smith, Kaleb E and Idrissi-Yaghir, Ahmad and Seibold, Constantin Marc and Li, Jianning and Heiliger, Lars and Friedrich, Christoph M and Truhn, Daniel and others},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023}
}
```

Arxiv to link paper on Hugging Face: https://arxiv.org/abs/2310.07321

## Contact

<amin.dada@uk-essen.de>