Edit model card

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

  • Architecture: GPT2LMHeadModel
  • Total Parameters: 124,439,808
  • Data Type (dtype): torch.bfloat16
  • Model Size: 0.24 GB

Benchmark Metrics Comparison

Metric attn_layer_mapper=all, attn_loss_fn=logsum, attn_projector=miles attn_layer_mapper=all, attn_loss_fn=raw_mse, attn_projector=miles teacher
ai2_arc (acc) 0.228 0.256 0.304
ai2_arc (acc_norm) 0.258 0.267 0.309
arc_challenge (acc) 0.186 0.177 0.184
arc_challenge (acc_norm) 0.227 0.202 0.214
arc_easy (acc) 0.27 0.335 0.424
arc_easy (acc_norm) 0.288 0.332 0.405
boolq (acc) 0.375 0.377 0.541
cola (mcc) 0.0 0.0 0.009
glue (acc) 0.454 0.444 0.41
glue (f1) 0.0 0.279 0.526
glue (mcc) 0.0 0.0 0.009
hellaswag (acc) 0.282 0.302 0.337
hellaswag (acc_norm) 0.275 0.308 0.384
mnli (acc) 0.326 0.331 0.323
mnli_mismatch (acc) 0.295 0.367 0.344
mrpc (acc) 0.316 0.336 0.515
mrpc (f1) 0.0 0.075 0.631
qnli (acc) 0.527 0.519 0.472
qqp (acc) 0.673 0.515 0.34
qqp (f1) 0.0 0.363 0.483
rte (acc) 0.52 0.57 0.516
sst2 (acc) 0.492 0.498 0.511
wikitext (bits_per_byte) 1.888 1.273 0.98
wikitext (byte_perplexity) 3.701 2.416 1.973
wikitext (word_perplexity) 1094.0 111.9 37.82
wnli (acc) 0.437 0.521 0.451

Resource Usage Comparison

  • VRAM Use: 7.7871 GB

Distillation (Teacher -> Student) Architecture Difference:

  • Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
  • Total Parameters: 124,439,808 -> 124,439,808
  • Data Type (dtype): torch.bfloat16 -> torch.bfloat16
  • Model Size: 0.24 GB -> 0.24 GB
Module Diff Details


Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

  • Num Samples: 247,500
  • Subset: 20231101.en
  • Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=cos, layer_mapper=layer-2, projector=miles))

Hyperparameters

The following hyperparameters were used during training:

Expand
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine_with_min_lr
  • lr_scheduler_warmup_ratio: 0.5
  • num_epochs: 1.0
  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=cos, layer_mapper=layer-2, projector=miles))
  • train_embeddings: True
  • lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fae8845cd00>
  • student_model_name_or_path: None
  • student_config_name_or_path: None
  • student_model_config: None
  • reinitialize_weights: None
  • copy_teacher_modules: [('lm_head', False)]
  • student_model_as_bitnet: True
  • dropout: None
  • teacher_model_name_or_path: gpt2
  • teacher_load_in_8bit: False
  • teacher_load_in_4bit: False
  • dataset_uri: wikimedia/wikipedia
  • dataset_subset: 20231101.en
  • dataset_split: train
  • dataset_column_name: text
  • dataset_sample_size: 250000
  • dataset_test_size: 0.01
  • gradient_accumulation_steps: 1
  • weight_decay: 0.0
  • max_grad_norm: 1.0
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • gradient_checkpointing: True

Framework Versions

  • Distily 0.3.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
33
Safetensors
Model size
124M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for distily/distily_test_attn_miles

Finetuned
(1179)
this model

Dataset used to train distily/distily_test_attn_miles