asier-gutierrez commited on
Commit
1a160e2
1 Parent(s): 721825b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -33,7 +33,7 @@ Some of the statistics of the corpus:
33
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
34
 
35
  ## Tokenization and pre-training
36
- We trained a BBPE tokenizer with a size of 50,262 tokens. We used 10,000 documents for validation and we trained the model for 48 hours into 16 computing nodes with 4 Nvidia V100 GPUs per node.
37
 
38
  ## Evaluation and results
39
  For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
 
33
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
34
 
35
  ## Tokenization and pre-training
36
+ The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens. The RoBERTa-base-bne pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
37
 
38
  ## Evaluation and results
39
  For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).