naxalpha
/

gated-state-space

naxalpha commited on Dec 16, 2022

Commit

7a42e6d

1 Parent(s): 9c22e75

add training proceedure

Files changed (2) hide show

.ipynb_checkpoints/README-checkpoint.md CHANGED Viewed

@@ -36,3 +36,8 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
     model.net.to_logits[1].weight.requires_grad_(False)
     model.net.to_logits[1].weight.copy_(emb)
 ```

     model.net.to_logits[1].weight.requires_grad_(False)
     model.net.to_logits[1].weight.copy_(emb)
 ```
+## Training proceedure
+Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.

README.md CHANGED Viewed

@@ -36,3 +36,8 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
     model.net.to_logits[1].weight.requires_grad_(False)
     model.net.to_logits[1].weight.copy_(emb)
 ```

     model.net.to_logits[1].weight.requires_grad_(False)
     model.net.to_logits[1].weight.copy_(emb)
 ```
+## Training proceedure
+Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.