jarodrigues
commited on
Commit
·
5bbdd69
1
Parent(s):
1c7bc3e
Update README.md
Browse files
README.md
CHANGED
@@ -58,16 +58,16 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
|
|
58 |
As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
|
59 |
|
60 |
To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
|
61 |
-
The model was trained using the maximum available memory capacity
|
62 |
We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
|
63 |
In total, around 200k training steps were taken across 50 epochs.
|
64 |
-
|
65 |
|
66 |
To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
|
67 |
-
The model was trained using the maximum available memory capacity
|
68 |
Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
69 |
However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
|
70 |
-
|
71 |
|
72 |
# Evaluation
|
73 |
|
@@ -173,3 +173,6 @@ If Albertina proves useful for your work, we kindly ask that you cite the follow
|
|
173 |
}
|
174 |
```
|
175 |
|
|
|
|
|
|
|
|
58 |
As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
|
59 |
|
60 |
To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
|
61 |
+
The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
|
62 |
We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
|
63 |
In total, around 200k training steps were taken across 50 epochs.
|
64 |
+
The model was trained for 1 day and 11 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
|
65 |
|
66 |
To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
|
67 |
+
The model was trained using the maximum available memory capacity resulting in a batch size of 832 samples (52 samples per GPU and applying gradient accumulation in order to approximate the batch size of the PT-BR model).
|
68 |
Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
69 |
However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
|
70 |
+
The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
|
71 |
|
72 |
# Evaluation
|
73 |
|
|
|
173 |
}
|
174 |
```
|
175 |
|
176 |
+
# Acknowledgments
|
177 |
+
|
178 |
+
TODO
|