PORTULAN
/

albertina-900m-portuguese-ptbr-encoder-brwac

@@ -84,10 +84,10 @@ a license for non-commercial use.
 # Training Data
-**Albertina PT-BR** was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
-[**Albertina PT-PT**](https://huggingface.co/PORTULAN/albertina-ptpt), in turn, was trained over a data set that resulted from gathering some openly available corpora of European Portuguese from the following sources:
 - [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301): the OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters. Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Portugal. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
 - [DCEP](https://joint-research-centre.ec.europa.eu/language-technology-resources/dcep-digital-corpus-european-parliament_en): the Digital Corpus of the European Parliament is a multilingual corpus including documents in all official EU languages published on the European Parliament&#39;s official website. We retained its European Portuguese portion.

 # Training Data
+**Albertina PT-BR** was trained over the 2.7 billion token [BrWac](https://huggingface.co/datasets/brwac) data set.
+[**Albertina PT-PT**](https://huggingface.co/PORTULAN/albertina-ptpt), in turn, was trained over a 2.2 billion token data set that resulted from gathering some openly available corpora of European Portuguese from the following sources:
 - [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301): the OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters. Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Portugal. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
 - [DCEP](https://joint-research-centre.ec.europa.eu/language-technology-resources/dcep-digital-corpus-european-parliament_en): the Digital Corpus of the European Parliament is a multilingual corpus including documents in all official EU languages published on the European Parliament&#39;s official website. We retained its European Portuguese portion.