PORTULAN
/

albertina-900m-portuguese-ptpt-encoder

foundation model

Inference Endpoints

Model card Files Files and versions Community

jarodrigues commited on May 11, 2023

Commit

450aa46

·

1 Parent(s): 8716635

Update README.md

Files changed (1) hide show

README.md +2 -4

README.md CHANGED Viewed

@@ -14,8 +14,6 @@ tags:
 - bertimbau
 license: mit
 datasets:
-- oscar
-- brwac
 - europarl_bilingual
 - PORTULAN/glue-ptpt
 - PORTULAN/parlamento-pt
@@ -92,7 +90,7 @@ This model is distributed free of charge under the [MIT](https://choosealicense.
 - [ParlamentoPT](https://www.parlamento.pt/): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
-**Albertina PT-BR**, in turn, was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
 ## Preprocessing
@@ -111,7 +109,7 @@ Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with
 However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
 The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
-To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
 The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
 We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
 In total, around 200k training steps were taken across 50 epochs.

 - bertimbau
 license: mit
 datasets:
 - europarl_bilingual
 - PORTULAN/glue-ptpt
 - PORTULAN/parlamento-pt
 - [ParlamentoPT](https://www.parlamento.pt/): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
+[**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), in turn, was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
 ## Preprocessing
 However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
 The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
+To train [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr) the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
 The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
 We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
 In total, around 200k training steps were taken across 50 epochs.