Edit model card

gpt trained with nanoGPT with custom attention

Reflex attention: in order not to forget the previous information, each layer uses not only the current weights, but also each previous layer SA_i (x): Self_attention on i layer using x heads CA_i: Cross_attention on i layer using 1 head

  1. SA_0 (8)
  2. Concat[SA_1 (7), CA_0]
  3. Concat[SA_2 (6), CA_0, CA_1]
  4. Concat[SA_3 (5), CA_0, CA_1, CA_2]
  5. Concat[SA_4 (4), CA_0, CA_1, CA_2, CA_3]
  6. Concat[SA_5 (3), CA_0, CA_1, CA_2, CA_3, CA_4]

Configs:

  • batch size = 32
  • bias = False
  • bloack_size = 1024
  • n heads = 8
  • h layers = 6
  • dropout = 0.0
  • n embed = 768
  • vocab size = 50304
  • gradient_accumulation_steps = 1
  • learning_rate = 1e-3
  • iters = 7250
  • lr_decay_iters = 5000
  • min_lr = 1e-5
  • warmup_iters = 400
  • mfu = 30.45935
  • train_loss = 3.89231
  • val_loss = 3.90462
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .

Dataset used to train Eka-Korn/reflex-attention_all-layers_GPT