jarodrigues commited on
Commit
25bc6ac
1 Parent(s): c260ec7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -15
README.md CHANGED
@@ -34,20 +34,19 @@ widget:
34
  # Albertina PT-PT Base
35
 
36
 
37
- **Albertina PT-*** is a foundation, large language model for the **Portuguese language**.
38
 
39
  It is an **encoder** of the BERT family, based on the neural architecture Transformer and
40
  developed over the DeBERTa model, with most competitive performance for this language.
41
- It has different versions that were trained for different variants of Portuguese (PT),
42
- namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
43
- and it is distributed free of charge and under a most permissible license.
44
 
45
- **Albertina PT-PT** is the version for European **Portuguese** from **Portugal**,
 
46
  and to the best of our knowledge, at the time of its initial distribution,
47
  it is the first competitive encoder specifically for this language and variant
48
  that is made publicly available and distributed for reuse.
49
 
50
- It is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
51
  For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
52
 
53
  ``` latex
@@ -90,12 +89,9 @@ DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBER
90
  - [ParlamentoPT](https://huggingface.co/datasets/PORTULAN/parlamento-pt): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
91
 
92
 
93
- [**Albertina PT-BR Base**](https://huggingface.co/PORTULAN/albertina-ptbr-base), in turn, was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set, specifically filtered by the Internet country code top-level domain of Brazil.
94
-
95
-
96
  ## Preprocessing
97
 
98
- We filtered the PT-PT and PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
99
  We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
100
 
101
 
@@ -109,10 +105,6 @@ We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
109
  A total of 200 training epochs were performed resulting in approximately 180k steps.
110
  The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
111
 
112
- To train [**Albertina PT-BR Base**](https://huggingface.co/PORTULAN/albertina-ptbr-base) we followed the same hyperparameterization as the Albertina-PT-PT Base model.
113
- The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
114
- The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
115
-
116
 
117
  <br>
118
 
@@ -129,7 +121,8 @@ We automatically translated the same four tasks from GLUE using [DeepL Translate
129
 
130
  | Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
131
  |--------------------------|----------------|----------------|-----------|-----------------|
132
- | **Albertina-PT-PT Base** | 0.6787 | 0.4507 | 0.8829 | 0.8581 |
 
133
 
134
  <br>
135
 
 
34
  # Albertina PT-PT Base
35
 
36
 
37
+ **Albertina PT-PT Base** is a foundation, large language model for European **Portuguese** from **Portugal**.
38
 
39
  It is an **encoder** of the BERT family, based on the neural architecture Transformer and
40
  developed over the DeBERTa model, with most competitive performance for this language.
41
+ It is distributed free of charge and under a most permissible license.
 
 
42
 
43
+ You may be also interested in [**Albertina PT-PT**](https://huggingface.co/PORTULAN/albertina-ptpt).
44
+ This is a larger version,
45
  and to the best of our knowledge, at the time of its initial distribution,
46
  it is the first competitive encoder specifically for this language and variant
47
  that is made publicly available and distributed for reuse.
48
 
49
+ **Albertina PT-PT Base** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
50
  For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
51
 
52
  ``` latex
 
89
  - [ParlamentoPT](https://huggingface.co/datasets/PORTULAN/parlamento-pt): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
90
 
91
 
 
 
 
92
  ## Preprocessing
93
 
94
+ We filtered the PT-PT corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
95
  We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
96
 
97
 
 
105
  A total of 200 training epochs were performed resulting in approximately 180k steps.
106
  The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
107
 
 
 
 
 
108
 
109
  <br>
110
 
 
121
 
122
  | Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
123
  |--------------------------|----------------|----------------|-----------|-----------------|
124
+ | **Albertina-PT-PT** | **0.8339** | 0.4225 | **0.9171**| **0.8801** |
125
+ | **Albertina-PT-PT Base** | 0.6787 | **0.4507** | 0.8829 | 0.8581 |
126
 
127
  <br>
128