asier-gutierrez commited on
Commit
e2f38ec
1 Parent(s): 82fd9d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -9
README.md CHANGED
@@ -20,10 +20,22 @@ widget:
20
 
21
  # RoBERTa base trained with data from National Library of Spain (BNE)
22
 
23
- ## Introduction
24
- This work presents the Spanish RoBERTa-base model. The model has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.
25
 
26
- ## Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
27
  For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
28
 
29
  ## Citing
@@ -38,9 +50,4 @@ Check out our paper for all the details: https://arxiv.org/abs/2107.07253
38
  archivePrefix={arXiv},
39
  primaryClass={cs.CL}
40
  }
41
- ```
42
-
43
- ## Corpora
44
- | Corpora | Number of documents | Size (GB) |
45
- |---------|---------------------|-----------|
46
- | BNE | 201,080,084 | 570GB |
 
20
 
21
  # RoBERTa base trained with data from National Library of Spain (BNE)
22
 
23
+ ## Model Description
24
+ RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa]() base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.
25
 
26
+ ## Training corpora and preprocessing
27
+ We cleaned 59TB of WARC files and we deduplicated them at computing node level. This resulted into 2TB of Spanish clean corpus. After that, we performed a global deduplication resulting into 570GB of text.
28
+
29
+ Some of the statistics of the corpus:
30
+
31
+ | Corpora | Number of documents | Number of tokens | Size (GB) |
32
+ |---------|---------------------|------------------|-----------|
33
+ | BNE | 201,080,084 | 135,733,450,668 | 570GB |
34
+
35
+ ## Tokenization and pre-training
36
+ We trained a BBPE tokenizer with a size of 50,262 tokens. We used 10,000 documents for validation and we trained the model for 48 hours into 16 computing nodes with 4 Nvidia V100 GPUs per node.
37
+
38
+ ## Evaluation and results
39
  For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
40
 
41
  ## Citing
 
50
  archivePrefix={arXiv},
51
  primaryClass={cs.CL}
52
  }
53
+ ```