File size: 5,833 Bytes

1fef0b3
f80052b
8abcce6
 
 
 
1fef0b3
 
 
 
 
 
 
 
 
 
8abcce6
1fef0b3
8abcce6
 
 
1fef0b3
8abcce6
 
f80052b
8abcce6
1fef0b3
 
 
8abcce6
1fef0b3
 
8abcce6
 
 
 
 
 
 
 
 
 
 
 
 
 
f396038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8abcce6
 
 
19e8304
8abcce6
19e8304
8abcce6
 
 
19e8304
8abcce6
 
 
 
 
 
 
 
 
 
 
 
 
4e7313d
8abcce6
 
 
 
 
 
 
 
 
f396038
8abcce6
 
 
1fef0b3
 
8abcce6
 
 
 
 
 
 
 
 
 
 
f396038
8abcce6
f396038
8abcce6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19e8304
 
1fef0b3

---
base_model: gpt2
datasets:
- wikimedia/wikipedia
library_name: Distily
license: mit
tags:
- bitnet
- 1.58b
- generated_from_trainer
model-index:
- name: distily_multi_experiment
  results: []
---


# Summary

Distilled with [Distily](https://github.com/lapp0/distily) library
using teacher model [gpt2](https://huggingface.co/gpt2)
on dataset [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment.

# Model description

More information needed

# Intended uses & limitations

More information needed
-->

# Model Architecture:
- **Architecture**: `GPT2LMHeadModel`
- **Total Parameters**: 124,439,808
- **Data Type (dtype)**: torch.bfloat16
- **Model Size**: 0.24 GB


# Evaluation Metrics Comparison

| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **teacher eval** |  | 43.25 | 61.25 |  |  |  |  | 11.6875 | 19.125 |
| 0 | 0 | 2473901162496.0 | 170424302305280.0 | 22.7948 | 25.4866 | 98.091 | 12.281 | 4060086272.0 | 71468255805440.0 |
| 2500 | 0.0404 | 800.0 | 6240.0 | 2.9661 | 25.4278 | 98.318 | 12.309 | 470.0 | 5024.0 |
| 5000 | 0.0808 | 326.0 | 1480.0 | 2.1697 | 25.4996 | 98.041 | 12.275 | 247.0 | 278.0 |
| 7500 | 0.1212 | 224.0 | 804.0 | 1.8396 | 25.5448 | 97.867 | 12.253 | 185.0 | 190.0 |
| 10000 | 0.1616 | 171.0 | 608.0 | 1.6412 | 25.4672 | 98.165 | 12.29 | 145.0 | 166.0 |
| 12500 | 0.2020 | 127.0 | 482.0 | 1.3752 | 25.4897 | 98.079 | 12.279 | 111.0 | 141.0 |
| 15000 | 0.2424 | 104.5 | 436.0 | 1.2398 | 25.4711 | 98.15 | 12.288 | 93.5 | 99.5 |
| 17500 | 0.2828 | 90.5 | 346.0 | 1.1286 | 25.4723 | 98.146 | 12.288 | 74.0 | 147.0 |
| 20000 | 0.3232 | 81.5 | 312.0 | 1.0325 | 25.4627 | 98.183 | 12.293 | 69.5 | 111.0 |
| 22500 | 0.3636 | 73.0 | 236.0 | 0.9000 | 25.4791 | 98.12 | 12.285 | 59.75 | 100.0 |
| 25000 | 0.4040 | 67.0 | 209.0 | 0.8527 | 25.4728 | 98.144 | 12.288 | 53.0 | 183.0 |
| 27500 | 0.4444 | 64.0 | 228.0 | 0.8201 | 25.4859 | 98.094 | 12.281 | 48.0 | 105.5 |
| 30000 | 0.4848 | 64.5 | 225.0 | 0.8103 | 25.489 | 98.082 | 12.28 | 51.75 | 77.5 |
| 32500 | 0.5253 | 64.0 | 194.0 | 0.8016 | 25.4563 | 98.208 | 12.296 | 46.5 | 117.5 |
| 35000 | 0.5657 | 63.5 | 188.0 | 0.7395 | 25.4507 | 98.229 | 12.298 | 44.0 | 73.0 |
| 37500 | 0.6061 | 60.25 | 172.0 | 0.7164 | 25.411 | 98.382 | 12.317 | 45.5 | 68.5 |
| 40000 | 0.6465 | 59.5 | 180.0 | 0.7014 | 25.4454 | 98.25 | 12.301 | 41.25 | 94.5 |
| 42500 | 0.6869 | 58.25 | 168.0 | 0.6708 | 25.4719 | 98.147 | 12.288 | 42.0 | 65.5 |
| 45000 | 0.7273 | 53.75 | 158.0 | 0.5781 | 25.3987 | 98.43 | 12.323 | 35.25 | 67.5 |
| 47500 | 0.7677 | 54.0 | 136.0 | 0.5538 | 25.4465 | 98.245 | 12.3 | 34.0 | 41.75 |
| 50000 | 0.8081 | 52.25 | 136.0 | 0.5368 | 25.4472 | 98.243 | 12.3 | 33.0 | 41.0 |
| 52500 | 0.8485 | 50.75 | 131.0 | 0.5244 | 25.4589 | 98.198 | 12.294 | 33.25 | 38.25 |
| 55000 | 0.8889 | 50.0 | 128.0 | 0.5073 | 25.4565 | 98.207 | 12.295 | 32.0 | 35.5 |
| 57500 | 0.9293 | 49.75 | 127.0 | 0.5019 | 25.4729 | 98.143 | 12.288 | 31.75 | 33.5 |
| 60000 | 0.9697 | 49.75 | 126.5 | 0.4983 | 25.4379 | 98.279 | 12.304 | 31.5 | 33.75 |
| 61875 | 1.0 | 49.75 | 126.5 | 0.4979 | 25.4846 | 98.098 | 12.282 | 31.5 | 33.75 |

# Resource Usage Comparison

- VRAM Use: 7.7851 GB

# Distillation (Teacher -> Student) Architecture Difference:

- **Architecture**: `GPT2LMHeadModel` -> `GPT2LMHeadModel`
- **Total Parameters**: 124,439,808 -> 124,439,808
- **Data Type (dtype)**: torch.bfloat16 -> torch.bfloat16
- **Model Size**: 0.24 GB -> 0.24 GB

<details>
<summary>Module Diff Details</summary>

```diff

```

</details>
<br/>

# Train Dataset
Trained on 145,744,973 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.

- Num Samples: `247,500`
- Subset: `20231101.en`
- Split: `train`


# Training Objective

```
DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=kl, layer_mapper=layer-2))
```

# Hyperparameters
The following hyperparameters were used during training:

<details>
<summary>Expand</summary>

- learning_rate: `0.0001`
- train_batch_size: `4`
- eval_batch_size: `8`
- seed: `42`
- optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
- lr_scheduler_type: `linear`
- lr_scheduler_warmup_ratio: `0.5`
- num_epochs: `1.0`
- distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=kl, layer_mapper=layer-2))`
- train_embeddings: `True`
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f0428288790>`
- student_model_name_or_path: `None`
- student_config_name_or_path: `None`
- student_model_config: `None`
- reinitialize_weights: `None`
- copy_teacher_modules: `[('lm_head', False)]`
- student_model_as_bitnet: `True`
- student_model_compile: `False`
- dropout: `None`
- teacher_model_name_or_path: `gpt2`
- teacher_load_in_8bit: `False`
- teacher_load_in_4bit: `False`
- teacher_model_compile: `False`
- dataset_uri: `wikimedia/wikipedia`
- dataset_subset: `20231101.en`
- dataset_split: `train`
- dataset_column_name: `text`
- dataset_sample_size: `250000`
- dataset_test_size: `0.01`
- gradient_accumulation_steps: `1`
- weight_decay: `0.0`
- max_grad_norm: `1.0`
- warmup_ratio: `0.5`
- warmup_steps: `0`
- gradient_checkpointing: `True`

</details>
<br/>


# Framework Versions
- Distily 0.2.0
- Transformers 4.44.1
- Pytorch 2.5.0.dev20240821+cu121
- Datasets 2.21.0