End of training

Browse files

Files changed (8) hide show

README.md +6 -6
logs/attn_norm=None, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/completed.flag +0 -0
logs/attn_norm=rmsnorm, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725154305.cfb07cadeb51 +3 -0
logs/attn_norm=rmsnorm, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725154736.cfb07cadeb51 +3 -0
logs/attn_norm=rmsnorm, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725162526.cfb07cadeb51 +3 -0
logs/attn_norm=rmsnorm_teacher_only, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725154495.cfb07cadeb51 +3 -0
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -44,7 +44,7 @@ More information needed
 # Resource Usage Comparison
-- VRAM Use: 8.0719 GB
 # Distillation (Teacher -> Student) Architecture Difference:
@@ -75,7 +75,7 @@ More information needed
 <br/>
 # Train Dataset
-Trained on 226,120,971 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `396,000`
 - Subset: `20231101.en`
@@ -85,7 +85,7 @@ Trained on 226,120,971 tokens from the [wikimedia/wikipedia](https://huggingface
 # Training Objective
 ```
-DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, projector=orthogonal))
 ```
 # Hyperparameters
@@ -101,9 +101,9 @@ The following hyperparameters were used during training:
 - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
 - lr_scheduler_type: `polynomial`
 - num_epochs: `1.0`
-- distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, projector=orthogonal))`
 - train_embeddings: `True`
-- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f69141039a0>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `distilbert/distilgpt2`
 - student_model_config: `None`
@@ -134,5 +134,5 @@ The following hyperparameters were used during training:
 # Framework Versions
 - Distily 0.4.1
 - Transformers 4.44.2
-- Pytorch 2.3.0
 - Datasets 2.21.0

 # Resource Usage Comparison
+- VRAM Use: 8.0694 GB
 # Distillation (Teacher -> Student) Architecture Difference:
 <br/>
 # Train Dataset
+Trained on 226,054,936 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `396,000`
 - Subset: `20231101.en`
 # Training Objective
 ```
+DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, norm=rmsnorm, projector=mlp))
 ```
 # Hyperparameters
 - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
 - lr_scheduler_type: `polynomial`
 - num_epochs: `1.0`
+- distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, norm=rmsnorm, projector=mlp))`
 - train_embeddings: `True`
+- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f1cc52d42e0>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `distilbert/distilgpt2`
 - student_model_config: `None`
 # Framework Versions
 - Distily 0.4.1
 - Transformers 4.44.2
+- Pytorch 2.4.0+cu121
 - Datasets 2.21.0

logs/attn_norm=None, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/completed.flag ADDED Viewed

File without changes

logs/attn_norm=rmsnorm, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725154305.cfb07cadeb51 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3831486e4cb0020005b7d215cb3358a01b97005b4a1bef331b81b8ae1c7d088
+size 5766

logs/attn_norm=rmsnorm, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725154736.cfb07cadeb51 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:acf4e787da81e1e48db1c2ce1f56be96deaeac051d0f66fc10a18b9466812107
+size 23671646

logs/attn_norm=rmsnorm, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725162526.cfb07cadeb51 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b9af248814af090970a257266cbc347cf08c4edb5048de500d410ce50c3bb488
+size 529

logs/attn_norm=rmsnorm_teacher_only, attn_projector=mlp, attn_weight=5, learning_rate=0.0002, per_device_train_batch_size=8, warmup_ratio=0/events.out.tfevents.1725154495.cfb07cadeb51 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:df45a92d91c6708deedc473128ae10036c2221f268ada432b30d431fe9a561bc
+size 5792

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:45d62c432f775cbdda3a750d07efe0d0529b4c05a031b4c6a84529b68eb1f8b1
 size 163832792

 version https://git-lfs.github.com/spec/v1
+oid sha256:ca42f343a8e34e60898e35f71158ea4da974f70f0dbc2628cba711bd0c217935
 size 163832792

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:14f714d43a749f120e9125fa032bbc3ef73d61dcb69b9c461f2cc0669f9d006d
 size 5560

 version https://git-lfs.github.com/spec/v1
+oid sha256:adfb1e2de2a99c1a9be060779fe5affd872685504452e48668f836e2df571788
 size 5560