gpt trained with nanoGPT with custom attention
Reflex attention: in order not to forget the previous information, each layer uses not only the current weights, but also each previous layer SA_i (x): Self_attention on i layer using x heads CA_i: Cross_attention on i layer using 1 head
- SA_0 (8)
- Concat[SA_1 (7), CA_0]
- Concat[SA_2 (6), CA_0, CA_1]
- Concat[SA_3 (5), CA_0, CA_1, CA_2]
- Concat[SA_4 (4), CA_0, CA_1, CA_2, CA_3]
- Concat[SA_5 (3), CA_0, CA_1, CA_2, CA_3, CA_4]
Configs:
- batch size = 32
- bias = False
- bloack_size = 1024
- n heads = 8
- h layers = 6
- dropout = 0.0
- n embed = 768
- vocab size = 50304
- gradient_accumulation_steps = 1
- learning_rate = 1e-3
- iters = 7250
- lr_decay_iters = 5000
- min_lr = 1e-5
- warmup_iters = 400
- mfu = 30.45935
- train_loss = 3.89231
- val_loss = 3.90462