amindada commited on
Commit
e192b87
1 Parent(s): 0bd657b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -28
README.md CHANGED
@@ -66,34 +66,11 @@ The following table presents the F1 scores:
66
  ## Publication
67
 
68
  ```bibtex
69
- @inproceedings{dada-etal-2023-impact,
70
- title = "On the Impact of Cross-Domain Data on {G}erman Language Models",
71
- author = "Dada, Amin and
72
- Chen, Aokun and
73
- Peng, Cheng and
74
- Smith, Kaleb and
75
- Idrissi-Yaghir, Ahmad and
76
- Seibold, Constantin and
77
- Li, Jianning and
78
- Heiliger, Lars and
79
- Friedrich, Christoph and
80
- Truhn, Daniel and
81
- Egger, Jan and
82
- Bian, Jiang and
83
- Kleesiek, Jens and
84
- Wu, Yonghui",
85
- editor = "Bouamor, Houda and
86
- Pino, Juan and
87
- Bali, Kalika",
88
- booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
89
- month = dec,
90
- year = "2023",
91
- address = "Singapore",
92
- publisher = "Association for Computational Linguistics",
93
- url = "https://aclanthology.org/2023.findings-emnlp.922",
94
- doi = "10.18653/v1/2023.findings-emnlp.922",
95
- pages = "13801--13813",
96
- abstract = "Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45{\%} over the previous state-of-the-art.",
97
  }
98
  ```
99
  ## Contact
 
66
  ## Publication
67
 
68
  ```bibtex
69
+ @inproceedings{dada2023impact,
70
+ title={On the Impact of Cross-Domain Data on German Language Models},
71
+ author={Dada, Amin and Chen, Aokun and Peng, Cheng and Smith, Kaleb E and Idrissi-Yaghir, Ahmad and Seibold, Constantin Marc and Li, Jianning and Heiliger, Lars and Friedrich, Christoph M and Truhn, Daniel and others},
72
+ booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
73
+ year={2023}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  }
75
  ```
76
  ## Contact