elinas's picture
Update README.md
1ea0496 verified
|
raw
history blame
3.32 kB
metadata
base_model:
  - elinas/Llama-3-15B-Instruct-zeroed
library_name: transformers
tags:
  - mergekit
  - merge
license: llama3

Llama-3-15B-Instruct-ft-zeroed

This is a QLoRA finetune of a merge of pre-trained language models created using mergekit.

The model is based on a "zeroed" passthrough merge of Llama-3-15B-Instruct-zeroed

This was primarily an experiment to see how a passthrough merge will respond to further finetuning, though this was done on a small dataset.

The goal was to make a "mid" sized model like Meta has released in the past and the merge method was inspired by mlabonne's Llama-3-120B.

The model was finetuned on 8192 context length and is likely reliable using RoPE up to 32k.

Further finetuning this model or finetuning the base model on more samples is encouraged.

Datasets

A small, high quality, dataset was used as a PoC / validation on stabilizing the model after finetuning.

Finetuning details

This is a QLoRA model and all modules were targeted.

lora_target_modules:
  - down_proj
  - o_proj

The model is coherent even with training the "zeroed" loayers and can write well. In the next experiment, all layers will be finetuned as this was the recommendation from [Charles Goddard] - thank you for the method of merging!

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 3
- total_train_batch_size: 6
- total_eval_batch_size: 6
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 25
- num_epochs: 1

Optimizer paged_adamw_8bit and Deepspeed ZeRO 3 was used at a LR of 1e-5 using the cosine scheduler for 1 epoch on 3x3090s taking 4h 12m 13s total.

Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.

W&B Run Summary

wandb: Run summary:
wandb:                eval/loss 0.94497
wandb:             eval/runtime 276.2864
wandb:  eval/samples_per_second 1.397
wandb:    eval/steps_per_second 0.235
wandb:               total_flos 12246605365248.0
wandb:              train/epoch 1.0
wandb:        train/global_step 579
wandb:          train/grad_norm 0.80411
wandb:      train/learning_rate 0.0
wandb:               train/loss 1.085
wandb:               train_loss 0.8834
wandb:            train_runtime 9893.1688
wandb: train_samples_per_second 0.351
wandb:   train_steps_per_second 0.059

Framework versions

  • PEFT 0.10.0
  • Transformers 4.40.0.dev0
  • Pytorch 2.3.0+cu121
  • Datasets 2.15.0
  • Tokenizers 0.15.0

Model Evaluation

TBD

If you have any questions or comments on the model, feel free to open a discussion in the community tab.

Built with Axolotl