Commit
·
013d9ea
1
Parent(s):
b5faea8
Update README.md
Browse files
README.md
CHANGED
@@ -34,7 +34,7 @@ Some of the statistics of the corpus:
|
|
34 |
| BNE | 201,080,084 | 135,733,450,668 | 570GB |
|
35 |
|
36 |
## Tokenization and pre-training
|
37 |
-
The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens. The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa large
|
38 |
|
39 |
## Evaluation and results
|
40 |
For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
|
|
|
34 |
| BNE | 201,080,084 | 135,733,450,668 | 570GB |
|
35 |
|
36 |
## Tokenization and pre-training
|
37 |
+
The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens. The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa large. The training lasted a total of 96 hours with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
|
38 |
|
39 |
## Evaluation and results
|
40 |
For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
|