Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ Architext GPT-J-162M is a transformer model trained using Ben Wang's Mesh Transf
|
|
5 |
|
6 |
The model consists of 12 layers with a model dimension of 768, and a feedforward dimension of 2048. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
|
7 |
|
8 |
-
#Training data
|
9 |
GPT-J 162B was pre-trained on the Pile, a large-scale curated dataset created by EleutherAI. It was then finetuned on synthetically generated data that was procedurally generated using the Rhinocers/Grasshopper software suite. The model was finetuned for 1.25 billion tokens over 11,500 steps on TPU v3-8. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
|
10 |
|
11 |
#Intended Use and Limitations
|
|
|
5 |
|
6 |
The model consists of 12 layers with a model dimension of 768, and a feedforward dimension of 2048. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
|
7 |
|
8 |
+
# Training data
|
9 |
GPT-J 162B was pre-trained on the Pile, a large-scale curated dataset created by EleutherAI. It was then finetuned on synthetically generated data that was procedurally generated using the Rhinocers/Grasshopper software suite. The model was finetuned for 1.25 billion tokens over 11,500 steps on TPU v3-8. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
|
10 |
|
11 |
#Intended Use and Limitations
|