ccasimiro commited on
Commit
5d015d5
·
1 Parent(s): af53512

Update readme

Browse files
Files changed (1) hide show
  1. README.md +43 -9
README.md CHANGED
@@ -4,23 +4,61 @@ tags:
4
  - masked-lm
5
  ---
6
 
7
- # BERTa: Catalan RoBERTa model
8
 
9
  ## Model description
10
 
11
- BERTa is a RoBERTa-base model trained for Catalan language. BERTa is trained with a **reference corpus**
12
- You can embed local or remote images using `![](...)`
 
13
 
14
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Limitations and bias
18
 
 
 
 
19
  ## Load the model
20
 
21
  ``` python
22
  from transformers import AutoTokenizer, AutoModelForMaskedLM
23
-
24
  tokenizer = AutoTokenizer.from_pretrained("bsc/roberta-base-ca-cased")
25
 
26
  model = AutoModelForMaskedLM.from_pretrained("bsc/roberta-base-ca-cased")
@@ -81,11 +119,7 @@ Below, an example of how to use the masked language modeling task with a pipelin
81
  ]
82
  ```
83
 
84
- ## Training data
85
 
86
- ## Pretraining
87
-
88
- ## Eval results
89
 
90
  ### BibTeX entry and citation info
91
 
 
4
  - masked-lm
5
  ---
6
 
7
+ # BERTa: RoBERTa-based Catalan language model
8
 
9
  ## Model description
10
 
11
+ BERTa is transformer-based masked language model for the Catalan language.
12
+ It is based on the RoBERTA architecture in its base version and has been trained on a large-scale corpus collected from
13
+ publicly available corpora and crawlers (more details in the next section)
14
 
15
+ ## Training data
16
+
17
+ The training corpus consists of several corpora gathered from web crawling and public corpora.
18
+
19
+ The publicly available corpora are:
20
+
21
+ 1. the Catalan part of the DOGC corpus, a set of documents from the Official Gazette of the Catalan Government
22
+
23
+ 2. the Catalan Open Subtitles, a collection of translated movie subtitles \cite{tiedemann2012parallel}
24
+
25
+ 3. the non-shuffled version of the Catalan part of the OSCAR corpus \cite{suarez2019asynchronous},
26
+ a collection of monolingual corpora, filtered from Common Crawl \footnote{https://commoncrawl.org/about/}
27
+
28
+ 4. The CaWac corpus, a web corpus of Catalan built from the .cat top-level-domain in late 2013 \cite{ljubesic2014cawac},
29
+ the non-deduplicated version
30
+
31
+ 5. the Catalan Wikipedia articles downloaded on 18-08-2020.
32
 
33
+ The crawled corpora are:
34
+
35
+ 6. The Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains; (
36
+ 7. the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government;
37
+
38
+ 8. the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the Catalan News Agency\footnote{https://www.acn.cat/}.
39
+
40
+ Our new Catalan text corpus, CaText, includes (1)
41
+ data from datasets already available in Catalan and
42
+ (2) data from three new crawlers we recently ran.
43
+
44
+ ## Preprocessing
45
+
46
+ ## Pretraining
47
+
48
+ ## Eval results
49
+
50
+ ## Intended uses & limitations
51
 
52
  ## Limitations and bias
53
 
54
+ ---
55
+
56
+ ## Using BERTa
57
  ## Load the model
58
 
59
  ``` python
60
  from transformers import AutoTokenizer, AutoModelForMaskedLM
61
+
62
  tokenizer = AutoTokenizer.from_pretrained("bsc/roberta-base-ca-cased")
63
 
64
  model = AutoModelForMaskedLM.from_pretrained("bsc/roberta-base-ca-cased")
 
119
  ]
120
  ```
121
 
 
122
 
 
 
 
123
 
124
  ### BibTeX entry and citation info
125