meta-llama/Llama-3.3-70B-Instruct · What are the rough VRAM requirements for this model?

I tried to run the model with my 20 GM VRAM and it seems to take forever... I am not very good at math, but it seems that the 70B model needs really vast memory to run.

My GPU memory ran out even though using bitsandbytes for load_in_4bit, and when I offload to CPU it takes more than 30 mins to do the inferring.

However, from the popularity of the model I suspect many people have been using this model succesfully.

I must be missing something? How can I run this locally with a not top of the notch GPU or is it impossible?

I don't know if this is pertinent, but I had some strange problems with LLaMa 3.2, too: With ollama and linux VM, it runs like magic, but when I try it with huggingface with a python script in local windows environment, it is REALLY slow. I thought it has something to do with hyperparameters, but I am not sure anymore.