Update README.md
Browse files
README.md
CHANGED
@@ -66,34 +66,11 @@ The following table presents the F1 scores:
|
|
66 |
## Publication
|
67 |
|
68 |
```bibtex
|
69 |
-
@inproceedings{
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
Smith, Kaleb and
|
75 |
-
Idrissi-Yaghir, Ahmad and
|
76 |
-
Seibold, Constantin and
|
77 |
-
Li, Jianning and
|
78 |
-
Heiliger, Lars and
|
79 |
-
Friedrich, Christoph and
|
80 |
-
Truhn, Daniel and
|
81 |
-
Egger, Jan and
|
82 |
-
Bian, Jiang and
|
83 |
-
Kleesiek, Jens and
|
84 |
-
Wu, Yonghui",
|
85 |
-
editor = "Bouamor, Houda and
|
86 |
-
Pino, Juan and
|
87 |
-
Bali, Kalika",
|
88 |
-
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
|
89 |
-
month = dec,
|
90 |
-
year = "2023",
|
91 |
-
address = "Singapore",
|
92 |
-
publisher = "Association for Computational Linguistics",
|
93 |
-
url = "https://aclanthology.org/2023.findings-emnlp.922",
|
94 |
-
doi = "10.18653/v1/2023.findings-emnlp.922",
|
95 |
-
pages = "13801--13813",
|
96 |
-
abstract = "Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45{\%} over the previous state-of-the-art.",
|
97 |
}
|
98 |
```
|
99 |
## Contact
|
|
|
66 |
## Publication
|
67 |
|
68 |
```bibtex
|
69 |
+
@inproceedings{dada2023impact,
|
70 |
+
title={On the Impact of Cross-Domain Data on German Language Models},
|
71 |
+
author={Dada, Amin and Chen, Aokun and Peng, Cheng and Smith, Kaleb E and Idrissi-Yaghir, Ahmad and Seibold, Constantin Marc and Li, Jianning and Heiliger, Lars and Friedrich, Christoph M and Truhn, Daniel and others},
|
72 |
+
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
|
73 |
+
year={2023}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
}
|
75 |
```
|
76 |
## Contact
|