jarodrigues
commited on
Commit
•
93e6739
1
Parent(s):
94ea1a0
Update README.md
Browse files
README.md
CHANGED
@@ -25,15 +25,15 @@ widget:
|
|
25 |
|
26 |
---
|
27 |
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
|
28 |
-
<p style="text-align: center;"> This is the model card for Albertina PT-BR
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina family</a>.
|
30 |
</p>
|
31 |
|
32 |
---
|
33 |
|
34 |
-
# Albertina PT-BR
|
35 |
|
36 |
-
**Albertina PT-BR
|
37 |
|
38 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
39 |
developed over the DeBERTa model, with most competitive performance for this language.
|
@@ -45,7 +45,7 @@ and to the best of our knowledge, these are encoders specifically for this langu
|
|
45 |
that, at the time of its initial distribution, set a new state of the art for it, and are made publicly available
|
46 |
and distributed for reuse.
|
47 |
|
48 |
-
**Albertina PT-BR
|
49 |
For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
|
50 |
|
51 |
|
@@ -70,9 +70,9 @@ Please use the above cannonical reference when using or citing this model.
|
|
70 |
|
71 |
# Model Description
|
72 |
|
73 |
-
**This model card is for Albertina-PT-BR
|
74 |
|
75 |
-
Albertina-PT-BR
|
76 |
|
77 |
DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBERTa/blob/master/LICENSE).
|
78 |
|
@@ -82,7 +82,7 @@ DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBER
|
|
82 |
# Training Data
|
83 |
|
84 |
|
85 |
-
[**Albertina PT-BR
|
86 |
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
|
87 |
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
|
88 |
|
@@ -95,10 +95,10 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
|
|
95 |
|
96 |
## Training
|
97 |
|
98 |
-
As codebase, we resorted to the [DeBERTa V1
|
99 |
|
100 |
|
101 |
-
To train [**Albertina PT-BR
|
102 |
The model was trained using the maximum available memory capacity resulting in a batch size of 3072 samples (192 samples per GPU).
|
103 |
We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
104 |
The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
|
@@ -124,7 +124,7 @@ We address four tasks from those in PLUE, namely:
|
|
124 |
|------------------------------|----------------|----------------|-----------|-----------------|
|
125 |
| **Albertina-PT-BR No-brWaC** | **0.7798** | 0.5070 | **0.9167**| 0.8743
|
126 |
| **Albertina-PT-BR** | 0.7545 | 0.4601 | 0.9071 | **0.8910** |
|
127 |
-
| **Albertina-PT-BR
|
128 |
|
129 |
|
130 |
<br>
|
|
|
25 |
|
26 |
---
|
27 |
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
|
28 |
+
<p style="text-align: center;"> This is the model card for Albertina PT-BR base.
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina family</a>.
|
30 |
</p>
|
31 |
|
32 |
---
|
33 |
|
34 |
+
# Albertina PT-BR base
|
35 |
|
36 |
+
**Albertina PT-BR base** is a foundation, large language model for American **Portuguese** from **Brazil**.
|
37 |
|
38 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
39 |
developed over the DeBERTa model, with most competitive performance for this language.
|
|
|
45 |
that, at the time of its initial distribution, set a new state of the art for it, and are made publicly available
|
46 |
and distributed for reuse.
|
47 |
|
48 |
+
**Albertina PT-BR base** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
|
49 |
For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
|
50 |
|
51 |
|
|
|
70 |
|
71 |
# Model Description
|
72 |
|
73 |
+
**This model card is for Albertina-PT-BR base**, with 100M parameters, 12 layers and a hidden size of 768.
|
74 |
|
75 |
+
Albertina-PT-BR base is distributed under an [MIT license](https://huggingface.co/PORTULAN/albertina-ptpt/blob/main/LICENSE).
|
76 |
|
77 |
DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBERTa/blob/master/LICENSE).
|
78 |
|
|
|
82 |
# Training Data
|
83 |
|
84 |
|
85 |
+
[**Albertina PT-BR base**](https://huggingface.co/PORTULAN/albertina-ptbr-base) was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set.
|
86 |
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
|
87 |
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
|
88 |
|
|
|
95 |
|
96 |
## Training
|
97 |
|
98 |
+
As codebase, we resorted to the [DeBERTa V1 base](https://huggingface.co/microsoft/deberta-base), for English.
|
99 |
|
100 |
|
101 |
+
To train [**Albertina PT-BR base**](https://huggingface.co/PORTULAN/albertina-ptpt-base), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
|
102 |
The model was trained using the maximum available memory capacity resulting in a batch size of 3072 samples (192 samples per GPU).
|
103 |
We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
104 |
The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
|
|
|
124 |
|------------------------------|----------------|----------------|-----------|-----------------|
|
125 |
| **Albertina-PT-BR No-brWaC** | **0.7798** | 0.5070 | **0.9167**| 0.8743
|
126 |
| **Albertina-PT-BR** | 0.7545 | 0.4601 | 0.9071 | **0.8910** |
|
127 |
+
| **Albertina-PT-BR base** | 0.6462 | **0.5493** | 0.8779 | 0.8501 |
|
128 |
|
129 |
|
130 |
<br>
|