Hardware recommendations?

#6
by k-nearest-neighbor - opened

What's the minimum and recommended VRAM?
I see the space is running on an A100 40GB.
Any guidance on tradeoffs of popular cards?

Excited!

Lightricks org

People have successfully run the model with 6GB of VRAM and 16GB of RAM with some tricks (quantized clip encoder, etc) and generating 512x512 resolution with 50 frames. This is the lowest VRAM requirement we've seen so far!
On an RTX 4090 users have generated 121 frames in 11 seconds, and on top-tier hardware (H100/fal.ai) it can generate a 512×768 video with 121 frames in just 4 seconds.

So the tradeoffs boil down to speed: lower VRAM setups may require reduced resolution, frame count, or slower generation times, while higher-end hardware can unlock lightning-fast performance and higher resolutions.

Thank you! Can you please tell how to quantize it? My friend has 8 GB VRAM and we wanted to run, but we got CUDA memory error.

Thank you! Can you please tell how to quantize it? My friend has 8 GB VRAM and we wanted to run, but we got CUDA memory error.

If a model is above 8 gb never expect it to run .. just use cogvideoX with cpu offload

Try as I might, I can't get this to generate anything at all, regardless of the settings:

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 66.00 MiB is free. Of the allocated memory 23.39 GiB is allocated by PyTorch, and 355.11 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Could you please give me some settings or other guidance to work with if this is indeed supposed to work with consumer cards.

People have successfully run the model with 6GB of VRAM and 16GB of RAM with some tricks (quantized clip encoder, etc) and generating 512x512 resolution with 50 frames. This is the lowest VRAM requirement we've seen so far!
On an RTX 4090 users have generated 121 frames in 11 seconds, and on top-tier hardware (H100/fal.ai) it can generate a 512×768 video with 121 frames in just 4 seconds.

So the tradeoffs boil down to speed: lower VRAM setups may require reduced resolution, frame count, or slower generation times, while higher-end hardware can unlock lightning-fast performance and higher resolutions.

Where's the best place to find the quantisations? @benibraz

8 vram RTX 4060 working.. the default workflow eats up ram and gpu .. i am making a tutorial soon

8 vram RTX 4060 working.. the default workflow eats up ram and gpu .. i am making a tutorial soon

Where will you be posting your tutorial? thx

i also would like to see the tutorial please! i use a 4090 but i struggle sometimes with ''allocation on device'' and sometimes i got even bluescreens after some generations.

the first always seems to go very fast but depending on unknown reasons it stops sometimes and i have to restart my whole computer to get it run again. Restarting comfy ui or stability matrix does not help at all.

i also would like to see the tutorial please! i use a 4090 but i struggle sometimes with ''allocation on device'' and sometimes i got even bluescreens after some generations.

the first always seems to go very fast but depending on unknown reasons it stops sometimes and i have to restart my whole computer to get it run again. Restarting comfy ui or stability matrix does not help at all.

in Load Clip node use fp16 version of text encoder t5xxl_fp16.safetensors or even fp8_e4m3fn version t5xxl_fp8_e4m3fn.safetensors | t5xxl_fp8_e4m3fn_scaled.safetensors
If using Text Encoder node i think you will have to also download clip_l.safetensors

Also give a try this workflow example_ltxv_stg.json which supports newest(v0.9.1) model ltx-video-2b-v0.9.1.safetensors.

In VideoCombine output node on the right you'll find video file version with STG support.

Still crashes? try to implement workflow from this comfy tutorial for ltx and use old v0.9 "mixed" precision model from Symphone/ltx-video-2b-v0.9-fp8 - it should consume almost your whole RTX VRAM(EDIT: now i've read that youve got 4090 not 4060 - so i think reinstalling cuda12.4 drivers and downloading fresh release of comfyui portable should fix all probles you've got) and if you have at least 32GB Memory RAM then it should run without any problems.

IMPORTANT

Right click on Desktop -> Click "Show more properties" from context menu and then click "Nvidia Control Panel" -> then from "3D Settings" from Tasks in the left side menu of NVIDIA Control Panel choose "Manage 3D Settings" and then in "Global Settings" find Setting CUDA - Sysmem Fallback Policy and choose Prefer Sysmem Fallback option. Now you should be able to offload models into memory RAM and say "goodbye" to crash/freezes/bluescreens beacuse of OOM errors. (At least if you won't try to load i.e. full-precision(fp32) t5xxl text_encoder model, or to run any other script/application which would also consume memory VRAM)

INSTALLATION

first off install cuda12.4 then download and unpack fresh-portable version of ComfyUI

download ComfyUI-Manager node script installer for portable comfy from here: install-manager-for-portable-version.bat and save it to ComfyUI_portable* folder and run script.

then run Comfy with script run_nvidia_gpu.bat -> drop workflow onto oppened browser window with running comfyui -> then click on right top corner on "MANAGER" btn -> "Install Missing Custom Nodes" -> select checkboxes of all missing nodes and click "install" on bottom then "restart". If you get VideoHelperCombine missing warning after restarting ComfyUI then click again on "restart" from "MANAGER" panel without trying to fix/reinstall nodes(i dont know why but i've got now such weird erros but after restart VideoHelperCombine node is working without need of reinstalling.

You can also try to install xformers and attention procesors to comfy(i.e. if using portable version) to achieve faster inference.
xformers with torch2.5.1 support:

.\PATH_TO_COMFYUI_PORTABLE\python_embedded\python.exe -m pip install -U xformers==0.0.28.post3 --extra-index-url https://download.pytorch.org/whl/cu124

flash-attention 2:

.\PATH_TO_COMFYUI_PORTABLE\python_embedded\python.exe -m pip install https://huggingface.co/lldacing/flash-attention-windows-wheel/blob/main/flash_attn-2.7.0.post2%2Bcu124torch2.5.1cxx11abiFALSE-cp312-cp312-win_amd64.whl

sageattention 2:

.\PATH_TO_COMFYUI_PORTABLE\python_embedded\python.exe -m pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.1.0-windows.post5/triton-3.1.0-cp312-cp312-win_amd64.whl

git clone https://github.com/thu-ml/SageAttention.git
cd sageattention 
.\PATH_TO_COMFYUI_PORTABLE\python_embedded\python.exe setup.py install  # or pip install -e .

Sign up or log in to comment