Llama-3-15B-Instruct-zeroed-ft

This is a QLoRA finetune of a merge of pre-trained language models created using mergekit.

The model is based on a "zeroed" passthrough merge of Llama-3-15B-Instruct-zeroed

This was primarily an experiment to see how a passthrough merge will respond to further finetuning, though this was done on a small dataset.

The model was finetuned on 8192 context length and is likely reliable using RoPE up to 32k.

Further finetuning this model or finetuning the base model on more samples is encouraged.

Datasets

Chat-Error/Pure-dove-sharegpt

A small, high quality, dataset was used as a PoC / validation on stabilizing the model after finetuning.

Finetuning details

This is a QLoRA model and the following modules were targeted.

lora_target_modules:
  - down_proj
  - o_proj

The model is coherent even with training the "zeroed" layers and can write well. In the next experiment, all layers will be finetuned as this was the recommendation from Charles Goddard - thank you for sharing the method of merging as well as Toasty Pigeon for bringing it to my attention!

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 3
- total_train_batch_size: 6
- total_eval_batch_size: 6
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 25
- num_epochs: 1

Optimizer paged_adamw_8bit and Deepspeed ZeRO 3 was used at a LR of 1e-5 using the cosine scheduler for 1 epoch on 3x3090s taking 2h 30m total.

Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.

W&B Run Summary

wandb: Run summary:
wandb:                eval/loss 0.94497
wandb:             eval/runtime 276.2864
wandb:  eval/samples_per_second 1.397
wandb:    eval/steps_per_second 0.235
wandb:               total_flos 12246605365248.0
wandb:              train/epoch 1.0
wandb:        train/global_step 579
wandb:          train/grad_norm 0.80411
wandb:      train/learning_rate 0.0
wandb:               train/loss 1.085
wandb:               train_loss 0.8834
wandb:            train_runtime 9893.1688
wandb: train_samples_per_second 0.351
wandb:   train_steps_per_second 0.059

Framework versions

PEFT 0.10.0
Transformers 4.40.0.dev0
Pytorch 2.3.0+cu121
Datasets 2.15.0
Tokenizers 0.15.0

Model Evaluation

TBD

If you have any questions or comments on the model, feel free to open a discussion in the community tab.

elinas
/

Llama-3-15B-Instruct-zeroed-ft

Llama-3-15B-Instruct-zeroed-ft

Datasets

Finetuning details

Framework versions

Model Evaluation

Model tree for elinas/Llama-3-15B-Instruct-zeroed-ft

Dataset used to train elinas/Llama-3-15B-Instruct-zeroed-ft

Spaces using elinas/Llama-3-15B-Instruct-zeroed-ft 6

Collection including elinas/Llama-3-15B-Instruct-zeroed-ft

Llama 3 Experiments