Update README.md
Browse files
README.md
CHANGED
@@ -32,7 +32,7 @@ The pre-training dataset consists of documents from different domains:
|
|
32 |
| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B |
|
33 |
| Medical | Smaller public datasets | 253MB | 179,776 | 50M |
|
34 |
| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M |
|
35 |
-
| Medical |
|
36 |
| Medical | Pubmed abstracts | 8.5GB | 21,044,382 | 1.7B |
|
37 |
| Medical | MIMIC III | 2.6GB | 24,221,834 | 695M |
|
38 |
| Medical | PMC-Patients-ReCDS | 2.1GB | 1,743,344 | 414M |
|
@@ -44,7 +44,7 @@ The pre-training dataset consists of documents from different domains:
|
|
44 |
## Benchmark
|
45 |
|
46 |
In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering,
|
47 |
-
classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection
|
48 |
When the datasets provided training, development, and test sets, we used them accordingly.
|
49 |
|
50 |
|
@@ -61,7 +61,7 @@ The following table presents the F1 scores:
|
|
61 |
| GottBERT | 87.15±0.19 | 72.76±0.378 | 51.12±1.20 | 74.25±0.80 | **78.18**±0.11 | 65.71±0.01 | 74.60±4.75 | 88.61±0.23 | 74.05±0.51 |
|
62 |
| GeBERTa-base | **88.06**±0.22 | **78.54**±0.32 | **53.16**±1.39 | **74.83**±0.36 | 78.13±0.15 | **68.37**±1.11 | **81.85**±5.23 | **89.14**±0.32 | **76.51**±0.32 |
|
63 |
|
64 |
-
<sup>1</sup>Is not published yet but described in the [MedBERT.de paper](https://arxiv.org/abs/2303.08179).
|
65 |
|
66 |
## Publication
|
67 |
|
|
|
32 |
| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B |
|
33 |
| Medical | Smaller public datasets | 253MB | 179,776 | 50M |
|
34 |
| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M |
|
35 |
+
| Medical | Medical Dissertations | 1.4GB | 14,496 | 295M |
|
36 |
| Medical | Pubmed abstracts | 8.5GB | 21,044,382 | 1.7B |
|
37 |
| Medical | MIMIC III | 2.6GB | 24,221,834 | 695M |
|
38 |
| Medical | PMC-Patients-ReCDS | 2.1GB | 1,743,344 | 414M |
|
|
|
44 |
## Benchmark
|
45 |
|
46 |
In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering,
|
47 |
+
classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets.
|
48 |
When the datasets provided training, development, and test sets, we used them accordingly.
|
49 |
|
50 |
|
|
|
61 |
| GottBERT | 87.15±0.19 | 72.76±0.378 | 51.12±1.20 | 74.25±0.80 | **78.18**±0.11 | 65.71±0.01 | 74.60±4.75 | 88.61±0.23 | 74.05±0.51 |
|
62 |
| GeBERTa-base | **88.06**±0.22 | **78.54**±0.32 | **53.16**±1.39 | **74.83**±0.36 | 78.13±0.15 | **68.37**±1.11 | **81.85**±5.23 | **89.14**±0.32 | **76.51**±0.32 |
|
63 |
|
64 |
+
<sup>1</sup>Is not published yet but is described in the [MedBERT.de paper](https://arxiv.org/abs/2303.08179).
|
65 |
|
66 |
## Publication
|
67 |
|