mmarimon commited on
Commit
e975afc
1 Parent(s): eb1b29c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -29
README.md CHANGED
@@ -35,20 +35,25 @@ widget:
35
  <details>
36
  <summary>Click to expand</summary>
37
 
38
- - [Model Description](#model-description)
39
- - [Intended Uses and Limitations](#intended-uses-and-limitations)
40
- - [How to Use](#how-to-use)
 
41
  - [Training](#training)
42
- - [Training Data](#training-data)
43
- - [Training Procedure](#training-procedure)
44
  - [Evaluation](#evaluation)
45
- - [CLUB Benchmark](#club-benchmark)
46
- - [Evaluation Results](#evaluation-results)
47
  - [Licensing Information](#licensing-information)
48
- - [Citation Information](#citation-information)
49
- - [Funding](#funding)
50
- - [Contributions](#contributions)
51
- - [Disclaimer](#disclaimer)
 
 
 
 
52
 
53
  </details>
54
 
@@ -58,12 +63,12 @@ The **roberta-base-ca-v2** is a transformer-based masked language model for the
58
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
59
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
60
 
61
- ## Intended Uses and Limitations
62
 
63
  **roberta-base-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
64
  However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
65
 
66
- ## How to Use
67
 
68
  Here is how to use this model:
69
 
@@ -80,6 +85,10 @@ res_hf = pipeline(text)
80
  pprint([r['token_str'] for r in res_hf])
81
  ```
82
 
 
 
 
 
83
  ## Training
84
 
85
  ### Training data
@@ -104,7 +113,7 @@ The training corpus consists of several corpora gathered from web crawling and p
104
  | Vilaweb | 0.06 |
105
  | Tweets | 0.02 |
106
 
107
- ### Training Procedure
108
 
109
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
110
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
@@ -115,7 +124,7 @@ The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
115
 
116
  ## Evaluation
117
 
118
- ### CLUB Benchmark
119
 
120
  The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
121
  that has been created along with the model.
@@ -168,7 +177,7 @@ Here are the train/dev/test splits of the datasets:
168
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
169
  | QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
170
 
171
- ### Evaluation Results
172
 
173
  | Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
174
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
@@ -180,11 +189,24 @@ Here are the train/dev/test splits of the datasets:
180
 
181
  <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
182
 
183
- ## Licensing Information
 
 
 
 
 
 
184
 
 
 
 
 
185
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
186
 
187
- ## Citation Information
 
 
 
188
 
189
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
190
  ```bibtex
@@ -209,17 +231,7 @@ If you use any of these resources (datasets or models) in your work, please cite
209
  }
210
  ```
211
 
212
- ## Funding
213
-
214
- This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
215
-
216
-
217
- ## Contributions
218
-
219
- [N/A]
220
-
221
-
222
- ## Disclaimer
223
 
224
  <details>
225
  <summary>Click to expand</summary>
 
35
  <details>
36
  <summary>Click to expand</summary>
37
 
38
+ - [Model description](#model-description)
39
+ - [Intended uses and limitations](#intended-use)
40
+ - [How to use](#how-to-use)
41
+ - [Limitations and bias](#limitations-and-bias)
42
  - [Training](#training)
43
+ - [Training data](#training-data)
44
+ - [Training procedure](#training-procedure)
45
  - [Evaluation](#evaluation)
46
+ - [CLUB benchmark](#club-benchmark)
47
+ - [Evaluation results](#evaluation-results)
48
  - [Licensing Information](#licensing-information)
49
+ - [Additional information](#additional-information)
50
+ - [Author](#author)
51
+ - [Contact information](#contact-information)
52
+ - [Copyright](#copyright)
53
+ - [Licensing information](#licensing-information)
54
+ - [Funding](#funding)
55
+ - [Citing information](#citing-information)
56
+ - [Disclaimer](#disclaimer)
57
 
58
  </details>
59
 
 
63
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
64
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
65
 
66
+ ## Intended uses and limitations
67
 
68
  **roberta-base-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
69
  However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
70
 
71
+ ## How to use
72
 
73
  Here is how to use this model:
74
 
 
85
  pprint([r['token_str'] for r in res_hf])
86
  ```
87
 
88
+ ## Limitations and bias
89
+ At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
90
+
91
+
92
  ## Training
93
 
94
  ### Training data
 
113
  | Vilaweb | 0.06 |
114
  | Tweets | 0.02 |
115
 
116
+ ### Training procedure
117
 
118
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
119
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
 
124
 
125
  ## Evaluation
126
 
127
+ ### CLUB benchmark
128
 
129
  The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
130
  that has been created along with the model.
 
177
  | QA (ViquiQuAD) | 14,239 | 11,255 | 1,492 | 1,429 |
178
  | QA (CatalanQA) | 21,427 | 17,135 | 2,157 | 2,135 |
179
 
180
+ ### Evaluation results
181
 
182
  | Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
183
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
 
189
 
190
  <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
191
 
192
+ ## Additional information
193
+
194
+ ### Author
195
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
196
+
197
+ ### Contact information
198
+ For further information, send an email to aina@bsc.es
199
 
200
+ ### Copyright
201
+ Copyright (c) 2022 Text Mining Unit at Barcelona Supercomputing Center
202
+
203
+ ### Licensing information
204
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
205
 
206
+ ### Funding
207
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
208
+
209
+ ### Citation information
210
 
211
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
212
  ```bibtex
 
231
  }
232
  ```
233
 
234
+ ### Disclaimer
 
 
 
 
 
 
 
 
 
 
235
 
236
  <details>
237
  <summary>Click to expand</summary>