add training proceedure
Browse files- .ipynb_checkpoints/README-checkpoint.md +5 -0
- README.md +5 -0
.ipynb_checkpoints/README-checkpoint.md
CHANGED
@@ -36,3 +36,8 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
|
|
36 |
model.net.to_logits[1].weight.requires_grad_(False)
|
37 |
model.net.to_logits[1].weight.copy_(emb)
|
38 |
```
|
|
|
|
|
|
|
|
|
|
|
|
36 |
model.net.to_logits[1].weight.requires_grad_(False)
|
37 |
model.net.to_logits[1].weight.copy_(emb)
|
38 |
```
|
39 |
+
|
40 |
+
|
41 |
+
## Training proceedure
|
42 |
+
|
43 |
+
Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.
|
README.md
CHANGED
@@ -36,3 +36,8 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
|
|
36 |
model.net.to_logits[1].weight.requires_grad_(False)
|
37 |
model.net.to_logits[1].weight.copy_(emb)
|
38 |
```
|
|
|
|
|
|
|
|
|
|
|
|
36 |
model.net.to_logits[1].weight.requires_grad_(False)
|
37 |
model.net.to_logits[1].weight.copy_(emb)
|
38 |
```
|
39 |
+
|
40 |
+
|
41 |
+
## Training proceedure
|
42 |
+
|
43 |
+
Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.
|