jarodrigues
commited on
Commit
•
25bc6ac
1
Parent(s):
c260ec7
Update README.md
Browse files
README.md
CHANGED
@@ -34,20 +34,19 @@ widget:
|
|
34 |
# Albertina PT-PT Base
|
35 |
|
36 |
|
37 |
-
**Albertina PT
|
38 |
|
39 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
40 |
developed over the DeBERTa model, with most competitive performance for this language.
|
41 |
-
It
|
42 |
-
namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
|
43 |
-
and it is distributed free of charge and under a most permissible license.
|
44 |
|
45 |
-
|
|
|
46 |
and to the best of our knowledge, at the time of its initial distribution,
|
47 |
it is the first competitive encoder specifically for this language and variant
|
48 |
that is made publicly available and distributed for reuse.
|
49 |
|
50 |
-
|
51 |
For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
|
52 |
|
53 |
``` latex
|
@@ -90,12 +89,9 @@ DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBER
|
|
90 |
- [ParlamentoPT](https://huggingface.co/datasets/PORTULAN/parlamento-pt): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
|
91 |
|
92 |
|
93 |
-
[**Albertina PT-BR Base**](https://huggingface.co/PORTULAN/albertina-ptbr-base), in turn, was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set, specifically filtered by the Internet country code top-level domain of Brazil.
|
94 |
-
|
95 |
-
|
96 |
## Preprocessing
|
97 |
|
98 |
-
We filtered the PT-PT
|
99 |
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
|
100 |
|
101 |
|
@@ -109,10 +105,6 @@ We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
|
109 |
A total of 200 training epochs were performed resulting in approximately 180k steps.
|
110 |
The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
|
111 |
|
112 |
-
To train [**Albertina PT-BR Base**](https://huggingface.co/PORTULAN/albertina-ptbr-base) we followed the same hyperparameterization as the Albertina-PT-PT Base model.
|
113 |
-
The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
|
114 |
-
The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
|
115 |
-
|
116 |
|
117 |
<br>
|
118 |
|
@@ -129,7 +121,8 @@ We automatically translated the same four tasks from GLUE using [DeepL Translate
|
|
129 |
|
130 |
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
|
131 |
|--------------------------|----------------|----------------|-----------|-----------------|
|
132 |
-
| **Albertina-PT-PT
|
|
|
133 |
|
134 |
<br>
|
135 |
|
|
|
34 |
# Albertina PT-PT Base
|
35 |
|
36 |
|
37 |
+
**Albertina PT-PT Base** is a foundation, large language model for European **Portuguese** from **Portugal**.
|
38 |
|
39 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
40 |
developed over the DeBERTa model, with most competitive performance for this language.
|
41 |
+
It is distributed free of charge and under a most permissible license.
|
|
|
|
|
42 |
|
43 |
+
You may be also interested in [**Albertina PT-PT**](https://huggingface.co/PORTULAN/albertina-ptpt).
|
44 |
+
This is a larger version,
|
45 |
and to the best of our knowledge, at the time of its initial distribution,
|
46 |
it is the first competitive encoder specifically for this language and variant
|
47 |
that is made publicly available and distributed for reuse.
|
48 |
|
49 |
+
**Albertina PT-PT Base** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
|
50 |
For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
|
51 |
|
52 |
``` latex
|
|
|
89 |
- [ParlamentoPT](https://huggingface.co/datasets/PORTULAN/parlamento-pt): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
|
90 |
|
91 |
|
|
|
|
|
|
|
92 |
## Preprocessing
|
93 |
|
94 |
+
We filtered the PT-PT corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
|
95 |
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
|
96 |
|
97 |
|
|
|
105 |
A total of 200 training epochs were performed resulting in approximately 180k steps.
|
106 |
The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
|
107 |
|
|
|
|
|
|
|
|
|
108 |
|
109 |
<br>
|
110 |
|
|
|
121 |
|
122 |
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
|
123 |
|--------------------------|----------------|----------------|-----------|-----------------|
|
124 |
+
| **Albertina-PT-PT** | **0.8339** | 0.4225 | **0.9171**| **0.8801** |
|
125 |
+
| **Albertina-PT-PT Base** | 0.6787 | **0.4507** | 0.8829 | 0.8581 |
|
126 |
|
127 |
<br>
|
128 |
|