metadata

base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

Architecture: GPT2LMHeadModel
Total Parameters: 124,439,808
Data Type (dtype): torch.bfloat16
Model Size: 0.24 GB

Evaluation Metrics Comparison

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.25	61.25					11.6875	19.125
0	0	1460288880640.0	113799453474816.0	20.3200	25.9904	96.189	12.043	7247757312.0	18966575579136.0
2500	0.0404	896.0	5856.0	2.1915	26.0834	95.846	12.0	604.0	11200.0
5000	0.0808	330.0	1424.0	1.5171	26.0834	95.847	12.0	262.0	296.0
7500	0.1212	210.0	816.0	1.2547	26.1072	95.759	11.989	188.0	162.0
10000	0.1616	162.0	584.0	1.0555	26.1053	95.766	11.99	131.0	124.5
12500	0.2020	114.5	466.0	0.8387	26.1335	95.663	11.977	88.5	139.0
15000	0.2424	100.0	384.0	0.7409	26.0745	95.879	12.004	77.0	130.0
17500	0.2828	87.0	336.0	0.6531	26.124	95.698	11.981	66.5	114.5
20000	0.3232	75.5	286.0	0.5932	26.1016	95.78	11.992	66.5	104.5
22500	0.3636	67.5	232.0	0.5080	26.1245	95.696	11.981	53.75	106.5
25000	0.4040	64.5	223.0	0.4751	26.0572	95.943	12.012	50.5	89.5
27500	0.4444	64.0	207.0	0.4674	26.1072	95.759	11.989	49.25	159.0
30000	0.4848	65.0	204.0	0.4668	26.047	95.98	12.017	49.5	100.5
32500	0.5253	65.5	215.0	0.4567	26.0814	95.854	12.001	48.5	85.0
35000	0.5657	63.25	188.0	0.4214	26.0269	96.054	12.026	45.25	107.0
37500	0.6061	62.0	181.0	0.4092	26.0391	96.009	12.02	41.75	109.0
40000	0.6465	59.25	178.0	0.3990	26.0153	96.097	12.031	40.75	93.0
42500	0.6869	59.25	190.0	0.3829	26.1161	95.726	11.985	40.75	76.0
45000	0.7273	53.75	148.0	0.3251	26.0653	95.913	12.008	36.0	72.0
47500	0.7677	52.25	141.0	0.3087	26.1075	95.758	11.989	33.75	53.25
50000	0.8081	50.75	138.0	0.2983	26.0725	95.886	12.005	34.0	38.5
52500	0.8485	51.25	134.0	0.2931	26.0671	95.906	12.007	33.5	42.0
55000	0.8889	50.0	131.0	0.2816	26.0871	95.833	11.998	32.5	36.0
57500	0.9293	49.75	131.0	0.2777	26.0528	95.959	12.014	32.0	35.5
60000	0.9697	49.5	131.0	0.2760	26.0611	95.928	12.01	31.875	34.25
61875	1.0	49.5	131.0	0.2757	26.0451	95.987	12.018	31.875	34.25

Resource Usage Comparison

VRAM Use: 7.5008 GB

Distillation (Teacher -> Student) Architecture Difference:

Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
Total Parameters: 124,439,808 -> 124,439,808
Data Type (dtype): torch.bfloat16 -> torch.bfloat16
Model Size: 0.24 GB -> 0.24 GB

Module Diff Details

Train Dataset

Trained on 145,756,992 tokens from the wikimedia/wikipedia dataset.

Num Samples: 247,500
Subset: 20231101.en
Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl))

Hyperparameters

The following hyperparameters were used during training:

Expand

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0
distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl))
train_embeddings: True
lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f48441cc250>
student_model_name_or_path: None
student_config_name_or_path: None
student_model_config: None
reinitialize_weights: None
copy_teacher_modules: [('lm_head', False)]
student_model_as_bitnet: True
student_model_compile: False
dropout: None
teacher_model_name_or_path: gpt2
teacher_load_in_8bit: False
teacher_load_in_4bit: False
teacher_model_compile: False
dataset_uri: wikimedia/wikipedia
dataset_subset: 20231101.en
dataset_split: train
dataset_column_name: text
dataset_sample_size: 250000
dataset_test_size: 0.01
gradient_accumulation_steps: 1
weight_decay: 0.0
max_grad_norm: 1.0
warmup_ratio: 0.5
warmup_steps: 0
gradient_checkpointing: True

Framework Versions

Distily 0.2.0
Transformers 4.44.1
Pytorch 2.3.0
Datasets 2.21.0