instructions
Thank you for the great model,
I want to train a model like this but smaller one, I have the dataset prepared, can you please give me few pointers how to start with the training. I am complete noob rn.
My system:
RTX 3090
64G Ram
If your dataset is formatted as JSONL (one json string per line, newline separated) and has keys "instruction", "response", you should be able to fine-tune a model with either FastChat (full-fine-tune) or qlora.
In either approach, you'll first want to get the llama 13b base model:
https://huggingface.co/decapoda-research/llama-13b-hf
Then, make sure you replace these two files in that base model:
special_tokens_map.json (replace it with: https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.1/resolve/main/special_tokens_map.json)
tokenizer_config.json (replace it with: https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.1/resolve/main/tokenizer_config.json)
qlora may be the easiest, and should theoretically work on a 3090. To do this, you'll want to use my. fork of qlora here: https://github.com/jondurbin/qlora
Be sure to install the dependencies: pip install -r ./qlora/requirements.txt
, then run something like:
export WANDB_API_KEY=[replace with your key]
export WANDB_PROJECT=[replace with your project name]
python qlora.py \
--model_name_or_path /path/to/llama-13b-hf \
--output_dir /path/to/output_dir \
--max_steps 900 \
--logging_steps 1 \
--save_strategy steps \
--data_seed 11422 \
--save_steps 25 \
--save_total_limit 5 \
--evaluation_strategy "no" \
--eval_dataset_size 2 \
--max_new_tokens 2048 \
--dataloader_num_workers 3 \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--lora_r 64 \
--lora_alpha 16 \
--lora_modules all \
--double_quant \
--quant_type nf4 \
--bf16 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--dataset /path/to/instructions.jsonl \
--dataset_format airoboros \
--model_max_len 2048 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--learning_rate 0.0001 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.05 \
--weight_decay 0.0 \
--seed 11422 \
--report_to wandb
You may be able to increase per_device_train_batch_size a bit depending on how much vram it's taking per pass, or you may need to set it to 1, just experiment and check vram usage. Make sure you replace the various paths, and update the --max_steps
value based on the number of instructions you have in the training data. I typically use 3 epochs, so you want steps = epochs * size of training data / (train batch size * gradient accumulation steps)
, so say you had 500 instructions, you'd want steps ~ 3 * 500 / (2 * 16) = 47
I don't know if the 3090 will have enough memory to do a full fine-tune of a 13b model, but you can try with my fork of the fastchat repo and flash attention, e.g. git clone https://github.com/jondurbin/FastChat && pip install ./FastChat && pip install flash_attn==1.0.5
FastChat expects the training data in conversation format, so you'd want to convert your instruction/response jsonl file via:
import json
import uuid
inputs = [json.loads(line) for line in open("instructions.jsonl").readlines()]
conversations = []
for row in inputs:
inputs = row['instruction']
conversations.append({
"id": str(uuid.uuid4()),
"conversations": [
{
"from": "human",
"value": inputs,
},
{
"from": "gpt",
"value": row['response']
},
],
})
with open("as_conversations.json", "w") as outfile:
outfile.write(json.dumps(conversations, indent=2))
Then, after installing and getting your dataset prepped, run something like:
export WANDB_API_KEY=[replace with your key from wandb.ai, free to signup]
export WANDB_PROJECT=[name of your project]
torchrun --nproc_per_node=1 --master_port=20001 ./FastChat/fastchat/train/train_mem.py \
--model_name_or_path /path/to/llama-13b-hf \
--data_path /path/to/as_conversations.json \
--output_dir /path/to/output_directory \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_steps 25 \
--save_total_limit 3 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.04 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap offload" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--bf16 True \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True
Play around with per_device_train_batch_size and gradient accumulation steps.
Hi and thanks for all this.
Is there a way to fine tune other open source models like the MPT or Falcon?
I am making a training UI on my lollms-webui interface and want to make it as easy as possible. Like the user selects the base model he wants to use, the database, the training parameters, and press train.
Also, does qlora run as lora? Can we fuse the model with the newly generated lora delta? And can we restart traning a model that has already been trained using qlora?
Thank you for all this
@ParisNeo
Here is a version using the QLora method:
https://colab.research.google.com/drive/1GrCda2sT0nE25fN-vOoMdpU-a93HQJg1?usp=sharing