Update README.md
Browse files
README.md
CHANGED
@@ -35,20 +35,25 @@ widget:
|
|
35 |
<details>
|
36 |
<summary>Click to expand</summary>
|
37 |
|
38 |
-
- [Model
|
39 |
-
- [Intended
|
40 |
-
- [How to
|
|
|
41 |
- [Training](#training)
|
42 |
-
- [Training
|
43 |
-
- [Training
|
44 |
- [Evaluation](#evaluation)
|
45 |
-
- [CLUB
|
46 |
-
- [Evaluation
|
47 |
- [Licensing Information](#licensing-information)
|
48 |
-
- [
|
49 |
-
- [
|
50 |
-
- [
|
51 |
-
- [
|
|
|
|
|
|
|
|
|
52 |
|
53 |
</details>
|
54 |
|
@@ -58,12 +63,12 @@ The **roberta-base-ca-v2** is a transformer-based masked language model for the
|
|
58 |
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
|
59 |
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
60 |
|
61 |
-
## Intended
|
62 |
|
63 |
**roberta-base-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
|
64 |
However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
65 |
|
66 |
-
## How to
|
67 |
|
68 |
Here is how to use this model:
|
69 |
|
@@ -80,6 +85,10 @@ res_hf = pipeline(text)
|
|
80 |
pprint([r['token_str'] for r in res_hf])
|
81 |
```
|
82 |
|
|
|
|
|
|
|
|
|
83 |
## Training
|
84 |
|
85 |
### Training data
|
@@ -104,7 +113,7 @@ The training corpus consists of several corpora gathered from web crawling and p
|
|
104 |
| Vilaweb | 0.06 |
|
105 |
| Tweets | 0.02 |
|
106 |
|
107 |
-
### Training
|
108 |
|
109 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
110 |
used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
|
@@ -115,7 +124,7 @@ The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
|
|
115 |
|
116 |
## Evaluation
|
117 |
|
118 |
-
### CLUB
|
119 |
|
120 |
The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
|
121 |
that has been created along with the model.
|
@@ -168,7 +177,7 @@ Here are the train/dev/test splits of the datasets:
|
|
168 |
| QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
|
169 |
| QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
|
170 |
|
171 |
-
### Evaluation
|
172 |
|
173 |
| Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
174 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
@@ -180,11 +189,24 @@ Here are the train/dev/test splits of the datasets:
|
|
180 |
|
181 |
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
|
182 |
|
183 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
184 |
|
|
|
|
|
|
|
|
|
185 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
186 |
|
187 |
-
|
|
|
|
|
|
|
188 |
|
189 |
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
190 |
```bibtex
|
@@ -209,17 +231,7 @@ If you use any of these resources (datasets or models) in your work, please cite
|
|
209 |
}
|
210 |
```
|
211 |
|
212 |
-
|
213 |
-
|
214 |
-
This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
|
215 |
-
|
216 |
-
|
217 |
-
## Contributions
|
218 |
-
|
219 |
-
[N/A]
|
220 |
-
|
221 |
-
|
222 |
-
## Disclaimer
|
223 |
|
224 |
<details>
|
225 |
<summary>Click to expand</summary>
|
|
|
35 |
<details>
|
36 |
<summary>Click to expand</summary>
|
37 |
|
38 |
+
- [Model description](#model-description)
|
39 |
+
- [Intended uses and limitations](#intended-use)
|
40 |
+
- [How to use](#how-to-use)
|
41 |
+
- [Limitations and bias](#limitations-and-bias)
|
42 |
- [Training](#training)
|
43 |
+
- [Training data](#training-data)
|
44 |
+
- [Training procedure](#training-procedure)
|
45 |
- [Evaluation](#evaluation)
|
46 |
+
- [CLUB benchmark](#club-benchmark)
|
47 |
+
- [Evaluation results](#evaluation-results)
|
48 |
- [Licensing Information](#licensing-information)
|
49 |
+
- [Additional information](#additional-information)
|
50 |
+
- [Author](#author)
|
51 |
+
- [Contact information](#contact-information)
|
52 |
+
- [Copyright](#copyright)
|
53 |
+
- [Licensing information](#licensing-information)
|
54 |
+
- [Funding](#funding)
|
55 |
+
- [Citing information](#citing-information)
|
56 |
+
- [Disclaimer](#disclaimer)
|
57 |
|
58 |
</details>
|
59 |
|
|
|
63 |
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
|
64 |
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
65 |
|
66 |
+
## Intended uses and limitations
|
67 |
|
68 |
**roberta-base-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
|
69 |
However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
70 |
|
71 |
+
## How to use
|
72 |
|
73 |
Here is how to use this model:
|
74 |
|
|
|
85 |
pprint([r['token_str'] for r in res_hf])
|
86 |
```
|
87 |
|
88 |
+
## Limitations and bias
|
89 |
+
At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
90 |
+
|
91 |
+
|
92 |
## Training
|
93 |
|
94 |
### Training data
|
|
|
113 |
| Vilaweb | 0.06 |
|
114 |
| Tweets | 0.02 |
|
115 |
|
116 |
+
### Training procedure
|
117 |
|
118 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
119 |
used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
|
|
|
124 |
|
125 |
## Evaluation
|
126 |
|
127 |
+
### CLUB benchmark
|
128 |
|
129 |
The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
|
130 |
that has been created along with the model.
|
|
|
177 |
| QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
|
178 |
| QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
|
179 |
|
180 |
+
### Evaluation results
|
181 |
|
182 |
| Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
183 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
|
|
189 |
|
190 |
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
|
191 |
|
192 |
+
## Additional information
|
193 |
+
|
194 |
+
### Author
|
195 |
+
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
|
196 |
+
|
197 |
+
### Contact information
|
198 |
+
For further information, send an email to aina@bsc.es
|
199 |
|
200 |
+
### Copyright
|
201 |
+
Copyright (c) 2022 Text Mining Unit at Barcelona Supercomputing Center
|
202 |
+
|
203 |
+
### Licensing information
|
204 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
205 |
|
206 |
+
### Funding
|
207 |
+
This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
|
208 |
+
|
209 |
+
### Citation information
|
210 |
|
211 |
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
212 |
```bibtex
|
|
|
231 |
}
|
232 |
```
|
233 |
|
234 |
+
### Disclaimer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
235 |
|
236 |
<details>
|
237 |
<summary>Click to expand</summary>
|