Text Generation
Transformers
PyTorch
English
gpt2
feature-extraction
causal-lm
text-generation-inference
rskuzma commited on
Commit
5bca29e
1 Parent(s): d777cc2

fixed typo

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -108,7 +108,7 @@ Recent works find significant duplicate data present in the Pile. Eleuther’s P
108
 
109
  We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
110
 
111
- All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
112
 
113
  <br>
114
 
 
108
 
109
  We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
110
 
111
+ All models were trained to Chinchilla point: 20 tokens per model parameter. Number of steps was chosen based on optimal batch size (varied by model) and fixed sequence length (2048). See Training Table, below, for details.
112
 
113
  <br>
114