bsc-temu
commited on
Commit
•
eb8ceb7
1
Parent(s):
78ad185
remove repeated text readme
Browse files
README.md
CHANGED
@@ -111,75 +111,7 @@ It contains the following tasks and their related datasets:
|
|
111 |
|
112 |
3. Text Classification (TC)
|
113 |
|
114 |
-
**[TeCla](
|
115 |
-
language: "ca"
|
116 |
-
tags:
|
117 |
-
- masked-lm
|
118 |
-
- BERTa
|
119 |
-
- catalan
|
120 |
-
license: apache-2.0
|
121 |
-
---
|
122 |
-
|
123 |
-
# BERTa: RoBERTa-based Catalan language model
|
124 |
-
|
125 |
-
## BibTeX citation
|
126 |
-
|
127 |
-
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
128 |
-
|
129 |
-
```bibtex
|
130 |
-
@inproceedings{armengol-estape-etal-2021-multilingual,
|
131 |
-
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
|
132 |
-
author = "Armengol-Estap{\'e}, Jordi and
|
133 |
-
Carrino, Casimiro Pio and
|
134 |
-
Rodriguez-Penagos, Carlos and
|
135 |
-
de Gibert Bonet, Ona and
|
136 |
-
Armentano-Oller, Carme and
|
137 |
-
Gonzalez-Agirre, Aitor and
|
138 |
-
Melero, Maite and
|
139 |
-
Villegas, Marta",
|
140 |
-
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
|
141 |
-
month = aug,
|
142 |
-
year = "2021",
|
143 |
-
address = "Online",
|
144 |
-
publisher = "Association for Computational Linguistics",
|
145 |
-
url = "https://aclanthology.org/2021.findings-acl.437",
|
146 |
-
doi = "10.18653/v1/2021.findings-acl.437",
|
147 |
-
pages = "4933--4946",
|
148 |
-
}
|
149 |
-
```
|
150 |
-
|
151 |
-
|
152 |
-
## Model description
|
153 |
-
|
154 |
-
BERTa is a transformer-based masked language model for the Catalan language.
|
155 |
-
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
|
156 |
-
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
157 |
-
|
158 |
-
## Training corpora and preprocessing
|
159 |
-
|
160 |
-
The training corpus consists of several corpora gathered from web crawling and public corpora.
|
161 |
-
|
162 |
-
The publicly available corpora are:
|
163 |
-
|
164 |
-
1. the Catalan part of the [DOGC](http://opus.nlpl.eu/DOGC-v2.php) corpus, a set of documents from the Official Gazette of the Catalan Government
|
165 |
-
|
166 |
-
2. the [Catalan Open Subtitles](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.ca.gz), a collection of translated movie subtitles
|
167 |
-
|
168 |
-
3. the non-shuffled version of the Catalan part of the [OSCAR](https://traces1.inria.fr/oscar/) corpus \\\\cite{suarez2019asynchronous},
|
169 |
-
a collection of monolingual corpora, filtered from [Common Crawl](https://commoncrawl.org/about/)
|
170 |
-
|
171 |
-
4. The [CaWac](http://nlp.ffzg.hr/resources/corpora/cawac/) corpus, a web corpus of Catalan built from the .cat top-level-domain in late 2013
|
172 |
-
the non-deduplicated version
|
173 |
-
|
174 |
-
5. the [Catalan Wikipedia articles](https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/cawiki/20200801/) downloaded on 18-08-2020.
|
175 |
-
|
176 |
-
The crawled corpora are:
|
177 |
-
|
178 |
-
6. The Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains
|
179 |
-
7. the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government
|
180 |
-
|
181 |
-
8. the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the [Catalan News Agency](https://www.acn.cat/)
|
182 |
-
https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
|
183 |
|
184 |
4. Semantic Textual Similarity (STS)
|
185 |
|
|
|
111 |
|
112 |
3. Text Classification (TC)
|
113 |
|
114 |
+
**[TeCla](https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
|
116 |
4. Semantic Textual Similarity (STS)
|
117 |
|