Update README.md
Browse files
README.md
CHANGED
@@ -48,17 +48,16 @@ For further details, check the respective publication:
|
|
48 |
Please use the above cannonical reference when using or citing this model.
|
49 |
|
50 |
<br>
|
51 |
-
<br>
|
52 |
|
53 |
|
54 |
-
|
55 |
|
56 |
**This model card is for Albertina-PT-PT**, with 900M parameters, 24 layers and a hidden size of 1536.
|
57 |
|
58 |
This model is distributed free of charge under the [MIT](https://choosealicense.com/licenses/mit/) license (permits commercial use, distribution, modification and private use).
|
59 |
|
60 |
|
61 |
-
|
62 |
|
63 |
# Training Data
|
64 |
|
@@ -73,15 +72,13 @@ This model is distributed free of charge under the [MIT](https://choosealicense.
|
|
73 |
**Albertina PT-BR**, in turn, was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
|
74 |
|
75 |
|
76 |
-
|
77 |
-
|
78 |
## Preprocessing
|
79 |
|
80 |
We filtered the PT-PT corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline, resulting in a data set of 8 million documents, containing around 2.2 billion tokens.
|
81 |
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
|
82 |
|
83 |
|
84 |
-
|
85 |
|
86 |
As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
|
87 |
|
@@ -98,7 +95,7 @@ In total, around 200k training steps were taken across 50 epochs.
|
|
98 |
The model was trained for 1 day and 11 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
|
99 |
|
100 |
|
101 |
-
|
102 |
|
103 |
# Evaluation
|
104 |
|
@@ -108,7 +105,7 @@ In one group, we have the two data sets from the [ASSIN 2 benchmark](https://hug
|
|
108 |
In the other group of data sets, we have the translations into PT-BR and PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue), which allowed us to test both Albertina-PT-* variants on a wider variety of downstream tasks.
|
109 |
|
110 |
|
111 |
-
|
112 |
|
113 |
[ASSIN 2](https://huggingface.co/datasets/assin2) is a **PT-BR data** set of approximately 10.000 sentence pairs, split into 6.500 for training, 500 for validation, and 2.448 for testing, annotated with semantic relatedness scores (range 1 to 5) and with binary entailment judgments.
|
114 |
This data set supports the task of semantic textual similarity (STS), which consists of assigning a score of how semantically related two sentences are; and the task of recognizing textual entailment (RTE), which given a pair of sentences, consists of determining whether the first entails the second.
|
@@ -119,7 +116,7 @@ This data set supports the task of semantic textual similarity (STS), which cons
|
|
119 |
| BERTimbau-large | 0.8913 | 0.8531 |
|
120 |
|
121 |
|
122 |
-
|
123 |
|
124 |
We resort to [PLUE](https://huggingface.co/datasets/dlb/plue) (Portuguese Language Understanding Evaluation), a data set that was obtained by automatically translating GLUE into **PT-BR**.
|
125 |
We address four tasks from those in PLUE, namely:
|
@@ -144,6 +141,7 @@ We automatically translated the same four tasks from GLUE using [DeepL Translate
|
|
144 |
| | | | | |
|
145 |
| **Albertina-PT-BR** | 0.7942 | 0.4085 | 0.9048 | **0.8847** |
|
146 |
|
|
|
147 |
|
148 |
# How to use
|
149 |
|
@@ -190,6 +188,8 @@ The model can be used by fine-tuning it for a specific task:
|
|
190 |
|
191 |
```
|
192 |
|
|
|
|
|
193 |
# Citation
|
194 |
|
195 |
When using or citing this model, kindly cite the following publication:
|
@@ -205,6 +205,8 @@ When using or citing this model, kindly cite the following publication:
|
|
205 |
}
|
206 |
```
|
207 |
|
|
|
|
|
208 |
# Acknowledgments
|
209 |
|
210 |
The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language,
|
|
|
48 |
Please use the above cannonical reference when using or citing this model.
|
49 |
|
50 |
<br>
|
|
|
51 |
|
52 |
|
53 |
+
# Model Description
|
54 |
|
55 |
**This model card is for Albertina-PT-PT**, with 900M parameters, 24 layers and a hidden size of 1536.
|
56 |
|
57 |
This model is distributed free of charge under the [MIT](https://choosealicense.com/licenses/mit/) license (permits commercial use, distribution, modification and private use).
|
58 |
|
59 |
|
60 |
+
<br>
|
61 |
|
62 |
# Training Data
|
63 |
|
|
|
72 |
**Albertina PT-BR**, in turn, was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
|
73 |
|
74 |
|
|
|
|
|
75 |
## Preprocessing
|
76 |
|
77 |
We filtered the PT-PT corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline, resulting in a data set of 8 million documents, containing around 2.2 billion tokens.
|
78 |
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
|
79 |
|
80 |
|
81 |
+
## Training
|
82 |
|
83 |
As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
|
84 |
|
|
|
95 |
The model was trained for 1 day and 11 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
|
96 |
|
97 |
|
98 |
+
<br>
|
99 |
|
100 |
# Evaluation
|
101 |
|
|
|
105 |
In the other group of data sets, we have the translations into PT-BR and PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue), which allowed us to test both Albertina-PT-* variants on a wider variety of downstream tasks.
|
106 |
|
107 |
|
108 |
+
## ASSIN 2
|
109 |
|
110 |
[ASSIN 2](https://huggingface.co/datasets/assin2) is a **PT-BR data** set of approximately 10.000 sentence pairs, split into 6.500 for training, 500 for validation, and 2.448 for testing, annotated with semantic relatedness scores (range 1 to 5) and with binary entailment judgments.
|
111 |
This data set supports the task of semantic textual similarity (STS), which consists of assigning a score of how semantically related two sentences are; and the task of recognizing textual entailment (RTE), which given a pair of sentences, consists of determining whether the first entails the second.
|
|
|
116 |
| BERTimbau-large | 0.8913 | 0.8531 |
|
117 |
|
118 |
|
119 |
+
## GLUE tasks translated
|
120 |
|
121 |
We resort to [PLUE](https://huggingface.co/datasets/dlb/plue) (Portuguese Language Understanding Evaluation), a data set that was obtained by automatically translating GLUE into **PT-BR**.
|
122 |
We address four tasks from those in PLUE, namely:
|
|
|
141 |
| | | | | |
|
142 |
| **Albertina-PT-BR** | 0.7942 | 0.4085 | 0.9048 | **0.8847** |
|
143 |
|
144 |
+
<br>
|
145 |
|
146 |
# How to use
|
147 |
|
|
|
188 |
|
189 |
```
|
190 |
|
191 |
+
<br>
|
192 |
+
|
193 |
# Citation
|
194 |
|
195 |
When using or citing this model, kindly cite the following publication:
|
|
|
205 |
}
|
206 |
```
|
207 |
|
208 |
+
<br>
|
209 |
+
|
210 |
# Acknowledgments
|
211 |
|
212 |
The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language,
|