How many GPU and CPU memory I need to finetune a 7B raven model
#21
by
fubincom
- opened
Hello, we have GPU clusters to train a 7B raven model with our own in-domain corpus. I have 4 v100 GPUs with 32GB memory in 4 pods whose CPU memory is 48GB. (micro_bsz is 1, total batch size is 4). I have tried deepspeed 2 and 3 with offload, but the CPU memory isn't sufficient. If I only fine-tune the model with LoRA, micro_bsz=8, resources are enough, and will use 23.2GB GPU memory and 19.2GB CPU memory in each pod (Trainable params is 20.3MB, with deep-speed stage2). I have set grad_cp =1, use fp16
So I have several questions:
- Is it possible to share the resources you used when you train the 7B model
- How can I set the accumulate_grad_batches in train.py scripts. I find this in my log file but I can't set it...
- I notice one phenomenon when I fine-tuned model with LoRA (actually w/o LoRA is same when I train a 1B5 model), at the beginning of training, they will occupy around 42GB CPU memory only, then split data into CPU and GPU memory separately, is it normal?
- What is the meaning of grad_cp, is it mean we will compute the gradient again, so we won't save it in GPU memory (but computation time is 30% lower), save around 7G for 7B model? (7G is estimated by me, maynot be correct.)
Are you trying to fine-tune a model for your documents? I am trying to do something similar like train my own model to achieve a fine-tuned model with my data, so I can retrieve it for chat-completions