BSC-LT
/

roberta-base-ca

Inference Endpoints

Model card Files Files and versions Community

ccasimiro commited on May 20, 2021

Commit

5d015d5

·

1 Parent(s): af53512

Update readme

Files changed (1) hide show

README.md +43 -9

README.md CHANGED Viewed

@@ -4,23 +4,61 @@ tags:
 - masked-lm
 ---
-# BERTa: Catalan RoBERTa model
 ## Model description
-BERTa is a RoBERTa-base model trained for Catalan language. BERTa is trained with a **reference corpus**
-You can embed local or remote images using `![](...)`
-## Intended uses & limitations
 ## Limitations and bias
 ## Load the model
 ``` python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 tokenizer = AutoTokenizer.from_pretrained("bsc/roberta-base-ca-cased")
 model = AutoModelForMaskedLM.from_pretrained("bsc/roberta-base-ca-cased")
@@ -81,11 +119,7 @@ Below, an example of how to use the masked language modeling task with a pipelin
 ]
 ```
-## Training data
-## Pretraining
-## Eval results
 ### BibTeX entry and citation info

 - masked-lm
 ---
+# BERTa: RoBERTa-based Catalan language model
 ## Model description
+BERTa is transformer-based masked language model for the Catalan language.
+It is based on the RoBERTA architecture in its base version and has been trained on a large-scale corpus collected from
+publicly available corpora and crawlers (more details in the next section)
+## Training data
+The training corpus consists of several corpora gathered from web crawling and public corpora.
+The publicly available corpora are:
+ 1. the Catalan part of the DOGC corpus, a set of documents from the Official Gazette of the Catalan Government
+ 2. the Catalan Open Subtitles, a collection of translated movie subtitles \cite{tiedemann2012parallel}
+ 3. the non-shuffled version of the Catalan part of the OSCAR corpus \cite{suarez2019asynchronous},
+    a collection of monolingual corpora, filtered from Common Crawl \footnote{https://commoncrawl.org/about/}
+ 4. The CaWac corpus, a web corpus of Catalan built from the .cat top-level-domain in late 2013 \cite{ljubesic2014cawac},
+    the non-deduplicated version
+ 5. the Catalan Wikipedia articles downloaded on 18-08-2020.
+The crawled corpora are:
+ 6. The Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains; (
+ 7. the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government;
+ 8. the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the Catalan News Agency\footnote{https://www.acn.cat/}.
+Our new Catalan text corpus, CaText, includes (1)
+data from datasets already available in Catalan and
+(2) data from three new crawlers we recently ran.
+## Preprocessing
+## Pretraining
+## Eval results
+## Intended uses & limitations
 ## Limitations and bias
+---
+## Using BERTa
 ## Load the model
 ``` python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 tokenizer = AutoTokenizer.from_pretrained("bsc/roberta-base-ca-cased")
 model = AutoModelForMaskedLM.from_pretrained("bsc/roberta-base-ca-cased")
 ]
 ```
 ### BibTeX entry and citation info