lapp0's picture
End of training
e2c763e verified
|
raw
history blame
5.83 kB
metadata
base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

  • Architecture: GPT2LMHeadModel
  • Total Parameters: 124,439,808
  • Data Type (dtype): torch.bfloat16
  • Model Size: 0.24 GB

Evaluation Metrics Comparison

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 43.25 61.25 11.6875 19.125
0 0 2473901162496.0 170424302305280.0 22.0554 29.7695 83.978 10.514 4060086272.0 71468255805440.0
2500 0.0404 744.0 5792.0 2.6072 29.8269 83.817 10.494 450.0 4192.0
5000 0.0808 316.0 1408.0 1.9087 29.8408 83.778 10.489 237.0 300.0
7500 0.1212 221.0 776.0 1.6174 29.8221 83.83 10.496 180.0 175.0
10000 0.1616 165.0 616.0 1.4278 29.8115 83.86 10.499 144.0 150.0
12500 0.2020 124.5 484.0 1.1898 29.8065 83.874 10.501 107.0 146.0
15000 0.2424 108.0 452.0 1.0632 29.7962 83.903 10.505 92.5 109.0
17500 0.2828 91.5 352.0 0.9642 29.7907 83.919 10.507 75.5 119.0
20000 0.3232 79.0 302.0 0.8892 29.8022 83.886 10.503 68.0 688.0
22500 0.3636 72.5 236.0 0.7640 29.8032 83.883 10.502 59.0 90.5
25000 0.4040 68.0 202.0 0.7181 29.8737 83.686 10.477 51.75 85.5
27500 0.4444 63.75 223.0 0.6853 29.8142 83.853 10.498 48.0 99.0
30000 0.4848 63.5 214.0 0.6791 29.8122 83.858 10.499 50.75 76.5
32500 0.5253 64.0 194.0 0.6660 29.794 83.91 10.505 46.75 96.5
35000 0.5657 60.25 176.0 0.6075 29.8902 83.639 10.472 41.5 62.25
37500 0.6061 60.5 169.0 0.5942 29.8616 83.72 10.482 43.5 78.5
40000 0.6465 57.5 176.0 0.5808 29.8082 83.87 10.5 39.5 80.5
42500 0.6869 58.0 172.0 0.5602 29.7977 83.899 10.504 40.0 58.75
45000 0.7273 52.5 145.0 0.4723 29.7887 83.924 10.507 34.25 47.0
47500 0.7677 52.75 135.0 0.4507 29.7668 83.986 10.515 33.5 41.25
50000 0.8081 51.25 133.0 0.4370 29.7994 83.894 10.504 31.875 39.75
52500 0.8485 49.75 127.5 0.4272 29.7762 83.96 10.512 32.25 38.0
55000 0.8889 49.25 126.5 0.4130 29.814 83.853 10.498 31.25 35.5
57500 0.9293 48.5 125.0 0.4079 29.7893 83.923 10.507 30.625 34.25
60000 0.9697 48.75 123.5 0.4046 29.8263 83.819 10.494 30.625 34.75
61875 1.0 48.75 124.0 0.4043 29.85 83.752 10.486 30.625 34.5

Resource Usage Comparison

  • VRAM Use: 7.7843 GB

`# Distillation (Teacher -> Student) Architecture Difference:

  • Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
  • Total Parameters: 124,439,808 -> 124,439,808
  • Data Type (dtype): 124439808 -> torch.bfloat16
  • Model Size: 0.24 GB -> 0.24 GB
Module Diff Details


Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

  • Num Samples: 247,500
  • Subset: 20231101.en
  • Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.5
  • num_epochs: 1.0
  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=layer-2))
  • train_embeddings: True
  • lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f010c128160>
  • student_model_name_or_path: None
  • student_config_name_or_path: None
  • student_model_config: None
  • reinitialize_weights: None
  • copy_teacher_modules: [('lm_head', False)]
  • student_model_as_bitnet: True
  • student_model_compile: False
  • dropout: None
  • teacher_model_name_or_path: gpt2
  • teacher_load_in_8bit: False
  • teacher_load_in_4bit: False
  • teacher_model_compile: False
  • dataset_uri: wikimedia/wikipedia
  • dataset_subset: 20231101.en
  • dataset_split: train
  • dataset_column_name: text
  • dataset_sample_size: 250000
  • dataset_test_size: 0.01
  • gradient_accumulation_steps: 1
  • weight_decay: 0.0
  • max_grad_norm: 1.0
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • gradient_checkpointing: True

Framework Versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0