Update README.md
Browse files
README.md
CHANGED
@@ -41,6 +41,7 @@ widget:
|
|
41 |
- [Training procedure](#training-procedure)
|
42 |
- [Evaluation](#evaluation)
|
43 |
- [Additional information](#additional-information)
|
|
|
44 |
- [Contact information](#contact-information)
|
45 |
- [Copyright](#copyright)
|
46 |
- [Licensing information](#licensing-information)
|
@@ -124,10 +125,13 @@ Some of the statistics of the corpus:
|
|
124 |
|
125 |
### Training procedure
|
126 |
The configuration of the **RoBERTa-large-bne** model is as follows:
|
|
|
127 |
- RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
|
128 |
|
129 |
The pretraining objective used for this architecture is masked language modeling without next sentence prediction.
|
|
|
130 |
The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens.
|
|
|
131 |
The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 96 hours with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
|
132 |
|
133 |
## Evaluation
|
|
|
41 |
- [Training procedure](#training-procedure)
|
42 |
- [Evaluation](#evaluation)
|
43 |
- [Additional information](#additional-information)
|
44 |
+
- [Author](#author)
|
45 |
- [Contact information](#contact-information)
|
46 |
- [Copyright](#copyright)
|
47 |
- [Licensing information](#licensing-information)
|
|
|
125 |
|
126 |
### Training procedure
|
127 |
The configuration of the **RoBERTa-large-bne** model is as follows:
|
128 |
+
|
129 |
- RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
|
130 |
|
131 |
The pretraining objective used for this architecture is masked language modeling without next sentence prediction.
|
132 |
+
|
133 |
The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens.
|
134 |
+
|
135 |
The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 96 hours with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
|
136 |
|
137 |
## Evaluation
|