PORTULAN
/

albertina-900m-portuguese-ptpt-encoder

@@ -58,16 +58,16 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
 As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
 To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
-The model was trained using the maximum available memory capacity -- it was trained for 1 day and 11 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM -- resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
 We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
 In total, around 200k training steps were taken across 50 epochs.
-Additionally, we used the standard BERT masking procedure with a 15% masking probability for each example.
 To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
-The model was trained using the maximum available memory capacity -- it was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM -- resulting in a batch size of 832 samples (52 samples per GPU and applying gradient accumulation in order to approximate the batch size of the PT-BR model).
 Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
 However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
 # Evaluation
@@ -173,3 +173,6 @@ If Albertina proves useful for your work, we kindly ask that you cite the follow
 }
 ```

 As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
 To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
+The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
 We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
 In total, around 200k training steps were taken across 50 epochs.
+The model was trained for 1 day and 11 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
 To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
+The model was trained using the maximum available memory capacity resulting in a batch size of 832 samples (52 samples per GPU and applying gradient accumulation in order to approximate the batch size of the PT-BR model).
 Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
 However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
+The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
 # Evaluation
 }
 ```
+# Acknowledgments
+TODO