lapp0's picture
End of training
4f9d02d verified
|
raw
history blame
5.64 kB
metadata
base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

  • Architecture: GPT2LMHeadModel
  • Total Parameters: 124,439,808
  • Data Type (dtype): torch.bfloat16
  • Model Size: 0.24 GB

Evaluation Metrics Comparison

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 43.25 61.25 11.6875 19.125
0 0 1460288880640.0 113799453474816.0 20.3200 25.9904 96.189 12.043 7247757312.0 18966575579136.0
2500 0.0404 896.0 5856.0 2.1915 26.0834 95.846 12.0 604.0 11200.0
5000 0.0808 330.0 1424.0 1.5171 26.0834 95.847 12.0 262.0 296.0
7500 0.1212 210.0 816.0 1.2547 26.1072 95.759 11.989 188.0 162.0
10000 0.1616 162.0 584.0 1.0555 26.1053 95.766 11.99 131.0 124.5
12500 0.2020 114.5 466.0 0.8387 26.1335 95.663 11.977 88.5 139.0
15000 0.2424 100.0 384.0 0.7409 26.0745 95.879 12.004 77.0 130.0
17500 0.2828 87.0 336.0 0.6531 26.124 95.698 11.981 66.5 114.5
20000 0.3232 75.5 286.0 0.5932 26.1016 95.78 11.992 66.5 104.5
22500 0.3636 67.5 232.0 0.5080 26.1245 95.696 11.981 53.75 106.5
25000 0.4040 64.5 223.0 0.4751 26.0572 95.943 12.012 50.5 89.5
27500 0.4444 64.0 207.0 0.4674 26.1072 95.759 11.989 49.25 159.0
30000 0.4848 65.0 204.0 0.4668 26.047 95.98 12.017 49.5 100.5
32500 0.5253 65.5 215.0 0.4567 26.0814 95.854 12.001 48.5 85.0
35000 0.5657 63.25 188.0 0.4214 26.0269 96.054 12.026 45.25 107.0
37500 0.6061 62.0 181.0 0.4092 26.0391 96.009 12.02 41.75 109.0
40000 0.6465 59.25 178.0 0.3990 26.0153 96.097 12.031 40.75 93.0
42500 0.6869 59.25 190.0 0.3829 26.1161 95.726 11.985 40.75 76.0
45000 0.7273 53.75 148.0 0.3251 26.0653 95.913 12.008 36.0 72.0
47500 0.7677 52.25 141.0 0.3087 26.1075 95.758 11.989 33.75 53.25
50000 0.8081 50.75 138.0 0.2983 26.0725 95.886 12.005 34.0 38.5
52500 0.8485 51.25 134.0 0.2931 26.0671 95.906 12.007 33.5 42.0
55000 0.8889 50.0 131.0 0.2816 26.0871 95.833 11.998 32.5 36.0
57500 0.9293 49.75 131.0 0.2777 26.0528 95.959 12.014 32.0 35.5
60000 0.9697 49.5 131.0 0.2760 26.0611 95.928 12.01 31.875 34.25
61875 1.0 49.5 131.0 0.2757 26.0451 95.987 12.018 31.875 34.25

Resource Usage Comparison

  • VRAM Use: 7.5008 GB

Distillation (Teacher -> Student) Architecture Difference:

  • Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
  • Total Parameters: 124,439,808 -> 124,439,808
  • Data Type (dtype): torch.bfloat16 -> torch.bfloat16
  • Model Size: 0.24 GB -> 0.24 GB
Module Diff Details


Train Dataset

Trained on 145,756,992 tokens from the wikimedia/wikipedia dataset.

  • Num Samples: 247,500
  • Subset: 20231101.en
  • Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl))

Hyperparameters

The following hyperparameters were used during training:

Expand
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.5
  • num_epochs: 1.0
  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl))
  • train_embeddings: True
  • lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f48441cc250>
  • student_model_name_or_path: None
  • student_config_name_or_path: None
  • student_model_config: None
  • reinitialize_weights: None
  • copy_teacher_modules: [('lm_head', False)]
  • student_model_as_bitnet: True
  • student_model_compile: False
  • dropout: None
  • teacher_model_name_or_path: gpt2
  • teacher_load_in_8bit: False
  • teacher_load_in_4bit: False
  • teacher_model_compile: False
  • dataset_uri: wikimedia/wikipedia
  • dataset_subset: 20231101.en
  • dataset_split: train
  • dataset_column_name: text
  • dataset_sample_size: 250000
  • dataset_test_size: 0.01
  • gradient_accumulation_steps: 1
  • weight_decay: 0.0
  • max_grad_norm: 1.0
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • gradient_checkpointing: True

Framework Versions

  • Distily 0.2.0
  • Transformers 4.44.1
  • Pytorch 2.3.0
  • Datasets 2.21.0