Can you show some demo for quantizing this I2V model in 24GB vram ?
I'm successful to run all T2V Diffusers models except for I2V. And I want BnB and AO examples. Thanks.
After some trial and error, I have been able to use this model in diffusers on a 3090 (24GB), by quantizing both the image_encoder and transformer to 4 bits using bits and bytes. The bnb settings for both was
load_in_4bit = true,
bnb_4bit_quant_type = "nf4"
bnb_4bit_compute_dtype = torch.bfloat16
The trial and error came with getting the data types right. Was getting a mismatch between float and bf16.
So make sure to specifically load the vae in float32
. I did not quantize the vae
. Quantize the image_encoder
as above but load in float32
. Quantize the transformer
as above but load in bfloat16
.
A 50 step, 81 frame, image to vid on 3090 takes about 42 minutes with the above configuration.
Thanks, Don. I should use the wan-sf branch of diffusers:pip install git+https://github.com/huggingface/diffusers@wan-sf
It's really good.
from .autonotebook import tqdm as notebook_tqdm
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 14/14 [00:06<00:00, 2.23it/s]
Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]
ding checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
ding checkpoint shards: 40%|ββββββββββββββββββββββββ | 2/5 [00:00<00:00, 12.03it/s]
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:00<00:00, 12.05it/s]
Loading pipeline components...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:00<00:00, 10.27it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [1:32:37<00:00, 111.15s/it]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
I hope the hf teams can merge this branch into the main as soon as possible.