fixed typo
Browse files
README.md
CHANGED
@@ -108,7 +108,7 @@ Recent works find significant duplicate data present in the Pile. Eleuther’s P
|
|
108 |
|
109 |
We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
|
110 |
|
111 |
-
All models were trained to Chinchilla point:
|
112 |
|
113 |
<br>
|
114 |
|
|
|
108 |
|
109 |
We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
|
110 |
|
111 |
+
All models were trained to Chinchilla point: 20 tokens per model parameter. Number of steps was chosen based on optimal batch size (varied by model) and fixed sequence length (2048). See Training Table, below, for details.
|
112 |
|
113 |
<br>
|
114 |
|