Setup Notes
For this model, a VM with 2 T4 GPUs was used.
Note 1. Output directory was initially lora-alpaca and then contents were moved to new folder when initializing git repository.
Log
(sqltest) chrisdono@deep-learning-duo-t4-3:~/alpaca-lora$ WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path 'spider' --output_dir './lora-alpaca' --num_epochs 10 --batch_size 32 --micro_batch_size 16 --learning_rate '9e-5' --add_eos_token
Adding last loss values not included in trainer json file from last checkpoint.
{'loss': 0.241, 'learning_rate': 1.0040816326530613e-05, 'epoch': 8.98}
{'loss': 0.2343, 'learning_rate': 9.42857142857143e-06, 'epoch': 9.04}
{'loss': 0.2376, 'learning_rate': 8.816326530612245e-06, 'epoch': 9.11}
{'loss': 0.2355, 'learning_rate': 8.204081632653062e-06, 'epoch': 9.17}
{'loss': 0.229, 'learning_rate': 7.591836734693877e-06, 'epoch': 9.24}
{'loss': 0.2325, 'learning_rate': 6.979591836734694e-06, 'epoch': 9.3}
{'loss': 0.24, 'learning_rate': 6.367346938775511e-06, 'epoch': 9.36}
{'loss': 0.2438, 'learning_rate': 5.755102040816327e-06, 'epoch': 9.43}
{'loss': 0.2391, 'learning_rate': 5.142857142857143e-06, 'epoch': 9.49}
{'loss': 0.2351, 'learning_rate': 4.530612244897959e-06, 'epoch': 9.55}
{'loss': 0.2289, 'learning_rate': 3.9183673469387755e-06, 'epoch': 9.62}
{'loss': 0.2294, 'learning_rate': 3.3061224489795924e-06, 'epoch': 9.68}
{'loss': 0.2344, 'learning_rate': 2.693877551020408e-06, 'epoch': 9.75}
{'loss': 0.2358, 'learning_rate': 2.0816326530612247e-06, 'epoch': 9.81}
{'loss': 0.2365, 'learning_rate': 1.469387755102041e-06, 'epoch': 9.87}
{'loss': 0.2309, 'learning_rate': 8.571428571428572e-07, 'epoch': 9.94}
{'loss': 0.2438, 'learning_rate': 2.4489795918367347e-07, 'epoch': 10.0}
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1570
{'train_runtime': 17144.6766, 'train_samples_per_second': 2.916, 'train_steps_per_second': 0.092, 'train_loss': 0.41175747267000234, 'epoch': 10.0}
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1570
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1570
/1570 [4:45:44<00:00, 10.92s/it]