jarodrigues commited on
Commit
450aa46
·
1 Parent(s): 8716635

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -14,8 +14,6 @@ tags:
14
  - bertimbau
15
  license: mit
16
  datasets:
17
- - oscar
18
- - brwac
19
  - europarl_bilingual
20
  - PORTULAN/glue-ptpt
21
  - PORTULAN/parlamento-pt
@@ -92,7 +90,7 @@ This model is distributed free of charge under the [MIT](https://choosealicense.
92
  - [ParlamentoPT](https://www.parlamento.pt/): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
93
 
94
 
95
- **Albertina PT-BR**, in turn, was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
96
 
97
 
98
  ## Preprocessing
@@ -111,7 +109,7 @@ Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with
111
  However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
112
  The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
113
 
114
- To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
115
  The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
116
  We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
117
  In total, around 200k training steps were taken across 50 epochs.
 
14
  - bertimbau
15
  license: mit
16
  datasets:
 
 
17
  - europarl_bilingual
18
  - PORTULAN/glue-ptpt
19
  - PORTULAN/parlamento-pt
 
90
  - [ParlamentoPT](https://www.parlamento.pt/): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
91
 
92
 
93
+ [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), in turn, was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
94
 
95
 
96
  ## Preprocessing
 
109
  However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
110
  The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
111
 
112
+ To train [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr) the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
113
  The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
114
  We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
115
  In total, around 200k training steps were taken across 50 epochs.