PORTULAN
/

albertina-100m-portuguese-ptpt-encoder

@@ -34,20 +34,19 @@ widget:
 # Albertina PT-PT Base
-**Albertina PT-*** is a foundation, large language model for the **Portuguese language**.
 It is an **encoder** of the BERT family, based on the neural architecture Transformer and
 developed over the DeBERTa model, with most competitive performance for this language.
-It has different versions that were trained for different variants of Portuguese (PT),
-namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
-and it is distributed free of charge and under a most permissible license.
-**Albertina PT-PT** is the version for European **Portuguese** from **Portugal**,
 and to the best of our knowledge, at the time of its initial distribution,
 it is the first competitive encoder specifically for this language and variant
 that is made publicly available and distributed for reuse.
-It is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
 For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
 ``` latex
@@ -90,12 +89,9 @@ DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBER
 - [ParlamentoPT](https://huggingface.co/datasets/PORTULAN/parlamento-pt): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
-[**Albertina PT-BR Base**](https://huggingface.co/PORTULAN/albertina-ptbr-base), in turn, was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set, specifically filtered by the Internet country code top-level domain of Brazil.
 ## Preprocessing
-We filtered the PT-PT and PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
 We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
@@ -109,10 +105,6 @@ We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
 A total of 200 training epochs were performed resulting in approximately 180k steps.
 The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
-To train [**Albertina PT-BR Base**](https://huggingface.co/PORTULAN/albertina-ptbr-base) we followed the same hyperparameterization as the Albertina-PT-PT Base model.
-The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
-The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
 <br>
@@ -129,7 +121,8 @@ We automatically translated the same four tasks from GLUE using [DeepL Translate
 | Model                    | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
 |--------------------------|----------------|----------------|-----------|-----------------|
-| **Albertina-PT-PT Base** |  0.6787        | 0.4507         | 0.8829    | 0.8581          |
 <br>

 # Albertina PT-PT Base
+**Albertina PT-PT Base** is a foundation, large language model for European **Portuguese** from **Portugal**.
 It is an **encoder** of the BERT family, based on the neural architecture Transformer and
 developed over the DeBERTa model, with most competitive performance for this language.
+It is distributed free of charge and under a most permissible license.
+You may be also interested in [**Albertina PT-PT**](https://huggingface.co/PORTULAN/albertina-ptpt).
+This is a larger version,
 and to the best of our knowledge, at the time of its initial distribution,
 it is the first competitive encoder specifically for this language and variant
 that is made publicly available and distributed for reuse.
+**Albertina PT-PT Base** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
 For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
 ``` latex
 - [ParlamentoPT](https://huggingface.co/datasets/PORTULAN/parlamento-pt): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
 ## Preprocessing
+We filtered the PT-PT corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
 We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
 A total of 200 training epochs were performed resulting in approximately 180k steps.
 The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
 <br>
 | Model                    | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
 |--------------------------|----------------|----------------|-----------|-----------------|
+| **Albertina-PT-PT**      | **0.8339**     | 0.4225         | **0.9171**| **0.8801**      |
+| **Albertina-PT-PT Base** |  0.6787        | **0.4507**     | 0.8829    | 0.8581          |
 <br>