mmarimon commited on
Commit
369efc7
1 Parent(s): c1ee3e0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -41,6 +41,7 @@ widget:
41
  - [Training procedure](#training-procedure)
42
  - [Evaluation](#evaluation)
43
  - [Additional information](#additional-information)
 
44
  - [Contact information](#contact-information)
45
  - [Copyright](#copyright)
46
  - [Licensing information](#licensing-information)
@@ -124,10 +125,13 @@ Some of the statistics of the corpus:
124
 
125
  ### Training procedure
126
  The configuration of the **RoBERTa-large-bne** model is as follows:
 
127
  - RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
128
 
129
  The pretraining objective used for this architecture is masked language modeling without next sentence prediction.
 
130
  The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens.
 
131
  The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 96 hours with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
132
 
133
  ## Evaluation
 
41
  - [Training procedure](#training-procedure)
42
  - [Evaluation](#evaluation)
43
  - [Additional information](#additional-information)
44
+ - [Author](#author)
45
  - [Contact information](#contact-information)
46
  - [Copyright](#copyright)
47
  - [Licensing information](#licensing-information)
 
125
 
126
  ### Training procedure
127
  The configuration of the **RoBERTa-large-bne** model is as follows:
128
+
129
  - RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
130
 
131
  The pretraining objective used for this architecture is masked language modeling without next sentence prediction.
132
+
133
  The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens.
134
+
135
  The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 96 hours with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
136
 
137
  ## Evaluation