tiiuae/falcon-40b · how much Vram does it take to run Falcon 40b

Toaster496

Jun 6, 2023

how much Vram/Hotswap Ram does it take to run Falcon 40b any1 got idears??

leoapolonio

Jun 7, 2023

157G

Toaster496

Jun 7, 2023

? what cards? how is that Vram or RAM?

serin32

Jun 7, 2023

•

edited Jun 8, 2023

Depends on if you want to do inference in 32, 16, 8 or 4 bit, but at full 32 bit I think it's about 80GB of VRAM.
Correction: 16 bit is 80ish GB and 32 bit would be around 160ish GB I believe. Thanks Mikael110 was thinking about 16 bit and not 32 when I wrote this.

Mikael110

Jun 8, 2023

•

edited Jun 8, 2023

With 8bit loading it consumes ~46GB of VRAM, and with 4bit loading it takes ~24GB VRAM. Those numbers exclude OS headroom, so don't expect 4bit to fit on actual 24GB cards, and 8bit will be a tight squeeze on 48GB cards, you will probably OOM once the context gets even remotely long. I can't give numbers for 16bit and 32bit since they OOM on the A100 80GB which I was testing on. But given that even 16bit is too big for the card I'm quite confident that 32bit is quite a bit larger than 80GB. Maybe that's the number leoapolonio was referencing? I could definitively see it actually being that high for full 32bit inference.

Toaster496

Jun 8, 2023

THANK YOU!

cchudant

Jun 8, 2023

Hi, the model is trained in bfloat16, not float32 - you need 40B x 2 byte per param = ~80Go to run it

FalconLLM

Technology Innovation Institute org Jun 9, 2023

We recommend 80-100GB to run inference on Falcon-40B comfortably.

FalconLLM changed discussion status to closed Jun 9, 2023