PlanTL-GOB-ES
/

longformer-base-4096-bne-es

@@ -49,7 +49,7 @@ widget:
 </details>
 ## Model description
-The **longformer-base-4096-bne-es** is the [Longformer](https://huggingface.co/allenai/longformer-base-4096) version of the [roberta-base-bne](https://https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) masked language model for the Spanish language. Using this kind of models, allows us to process larger contexts as input without needing to use additional aggregation strategies. The model started from the **roberta-base-bne** checkpoint and was pretrained for MLM on long documents from the National Library of Spain.
 The Longformer model uses a combination of sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations. Please refer to the original [paper](https://arxiv.org/abs/2004.05150) for more details on how to set global attention.
@@ -94,7 +94,7 @@ Some of the statistics of the corpus:
 |---------|---------------------|------------------|-----------|
 | BNE     |         201,080,084 |  135,733,450,668 |     570GB |
-For this Longformer, we have used a small random partition of 7,2GB containing documents with less than 4096 tokens as a training split.
 ### Tokenization and pre-training
 The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens. The RoBERTa-base-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 40 hours with 8 computing nodes each one with 2 AMD MI50 GPUs of 32GB VRAM.

 </details>
 ## Model description
+The **longformer-base-4096-bne-es** is the [Longformer](https://huggingface.co/allenai/longformer-base-4096) version of the [roberta-base-bne](https://https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) masked language model for the Spanish language. Using these kinde of models, allows us to process larger contexts as input without needing to use additional aggregation strategies. The model started from the **roberta-base-bne** checkpoint and was pretrained for MLM on long documents from the National Library of Spain.
 The Longformer model uses a combination of sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations. Please refer to the original [paper](https://arxiv.org/abs/2004.05150) for more details on how to set global attention.
 |---------|---------------------|------------------|-----------|
 | BNE     |         201,080,084 |  135,733,450,668 |     570GB |
+For this Longformer, we used a small random partition of 7,2GB containing documents with less than 4096 tokens as a training split.
 ### Tokenization and pre-training
 The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens. The RoBERTa-base-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 40 hours with 8 computing nodes each one with 2 AMD MI50 GPUs of 32GB VRAM.