EDITED: No H100 needed, very easy model to run across all chips.
EDIT: title change - there isn't any issue just bad documentation on vram alt chip usage.
TLDR thread update: No issue, just not nearly as heavy as company advertising. You do not need H100 anything here - it's a total waste of gpu/time/money. Larger custom SDXL models put more strain on the gpu than this does so far. IF the frame goes to 1080p then maybe, but for now, no insult towards anyone at all - this can easily be run on anything that has 24Gb vram or less, and due to the discoveries I actually have it running on a dinosaur 1070gtx with 8gb vram and the front page docs should be adjusted and updated accordingly.
For the sample script and what to remove see my 2nd to last post and screenshot on the right side. it highlights the cuda core thing and no switches.
THREAD CLOSED - THANKS FOR THE CLARIFICATIONS TO THE SETTINGS EVERYONE! =)
//////
ORIGINAL POST:
I have it working in comfyui on a VM remote server, 8x xeon, 76gb ram, L40 gpu (48gb vram). SSD yada yada standard server stuff.
I've attached a few pics. I'm probably doing something very wrong as this feels...too easy? For the machine?! compared to other alt Ai models I run the gpu is handling it like butter, so much so in fact I think there is something wrong as in your docs here on the site and things state you need a 21+ gb vram if not more, like h100's.
I thought I would try it on my L40 before loading everything on the H100 -> seems I don't even need to bother?! Clearly I'm doing something wrong this can't be right. Thanks everyone for chiming in. There is no way the model is 9+, the clip is 4+ and the vae adding another layer -that it's only using 2.7-4.8gb of vram here, is it swap madness or really really efficient?! (SD has switches to split the parts of the model and round robin them as needed) is this doing the same?
Help me max this bad boy =) LOL
/ vids take roughly 565 seconds (9ish minutes) as is. Like...It's not slouching in the least the cuda cores are maxed - but wth - load it all in vram here! you have lots lol
Edit: or am I being childish and this box is a beast it just takes it like baby carrots?! am i out of touch?!
I believe you are asking this question.
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
In the earliest implementations of the model, the built-in optimizations in the diffusers library were not compatible. However, with the subsequent upgrades of the diffusers version, these built-in optimizations could be applied to the CogVideoX model. Of course, these optimizations can be turned off, which would load the entire model into VRAM, significantly speeding up the process.
hmm that's interesting yes I believe this was in the docs.
Seems compfyui doesn't obviously take advantage of this. Oddly, I very very much enjoy gui for everything - for this text to video...oddly i see no point?! it's complicated a bit for nothing I find needing 45 nodes to work.
I will try hard to make a clean VM on unbuntu to try and compile this directly using the above as well to abuse the entire gpu.
It's cake sprinkling now <_< bit irritating LOL. I wanna max this L40 and make it sweat a bit. thanks for the tips. Have a cute gen'd frog video haha
So I made a new VM, and L40's are expensive so I grabbed my trusty A6k gpu as they are brothers and if I can't max one then i can't max the other.
Fresh updated torch the lot -> manually compile the DIR, run a prompt from script = working. (non comfyui)
I went to mess with the posted
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
And literally everything became worst. like 2x/3x as slow 28 minutes to generate nonsense. Hard no here. Also while playing with the settings and trying to enable just one then another, things like that, the best or most usage from the vram I could muster was 16Gb load on the 48gb total. So something clearly is wrong here as this is a totally different server. Way better than it using only 4.4gb vram like above but not even close to it's capacity.
How is it I can't max my vram with this model? I can murder the gpu by trying to load a larger model like the mixtral22b for instance.
I'm clearly doing something wrong and can get more out of this for sure, thank you very much for the assistance anyone!
Well, this is the H100 gpu. Still cannot get it past using 16gb vram. Ever.
The best combo so far is the 3 switches > gives me 1x video per 3 minutes. literally mind boggling fast speeds to generate. Much appreciated on that front chaps!
But only using 16gb vram? Is this normal?! If yes...i'm quite stunned. But clearly I'm an idiot and not doing something right - again. As you guys are all using h100's as well correct? Yours are maxed?
How? what config are you using? Thanks!
Hello! You should exclude lines 12-14 above for full speed and full memory use on an H100.
pipe.enable_squeqntial_cpu_offload()
slows things down the most because each layer will be added to the GPU one at a time to process the step and then removed from GPU.pipe.enable_model_cpu_offload()
will shuffle each model (e.g. T5 encoder, transformer, vae) onto the GPU as its needed. Not as slow as sequential offload but still a bit slower than all components on the gpu at one time.- The
pipe.vae
adjustments don't impact speed that much. With both enabled a 4090 was able to complete the VAE decode step in 7 seconds with 49 frames.
I have this running on 2x 4090s with the transformer on one 4090 and the rest on the second 4090. Transformer processing took 360 seconds, rest of the steps (prompt encode, vae decode) took less than 10 seconds.
To recap, do not include any of these for full inference speed on an H100 or A100 80GB.
Diffusers documentation goes into more depth on what each one does https://huggingface.co/docs/diffusers/en/optimization/memory
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
I was looking for more info on these settings thank you Astra!
As for disabling all, i recall when i removed the cpu stuff it errors out with some tensor mismatch - you cant run cpu on cuda etc.. I'll need to re-run later tonight and post back in case, but so far removing these has either given me some error or increased the time to 28minutes.
So rooting around this new doc i'll try to narrow things down and post back. thank you!
@RobbyGordon56 You have to move the pipeline to cuda. You can do this very easily by doing
pipe.to('cuda')
Thank you both for the comments and help @ad-astra-video and @YaTharThShaRma999 -> below is the results so far.
Before using my h100 again testing ona cheaper vm -> great increase in vram but still not even close to maxing it. lower system ram great! shaved a minute off the generation which is also great (on cheaper chips meaning you can squeeze in an extra vis per hour with the savings!) [ edit: 8min on a very cheap A6000 gpu is insane!]
So? Next step to make it, force it, to gobble the 48Gb vram? ideas?
I can try running any of your scripts if you would be so kind as to post if they are any different. Prompt i don't care about mark gibberish i get it, I'm trying to max these gpus!
If you are not maxing VRAM, you are hitting limits on processing power of the GPU. When the VAE hits it will eat more VRAM.
The transformer at 50 steps eats the processing time. Even with my splitting this across GPUs has spent most the time on the transformer (97%) which boils down to the tensor core/memory speed performance I believe. The A100/H100 have the best tensor core and memory for the generation.
@ad-astra-video thanks for the tip.
I guess from my reading their front page I was under the impression it would be 26gb x 2 or 3x as stated blowing up into the 60+ish range which is why they were using h100's. Not like the 6GB x3 + vae + etc to make 26gb rough total. This needs a big clarification on the front page I believe as a lot of my other custom SDXL models easily fill 42/48Gb vram on those chips to run 100% off gpu. I am aware it's different structure but you get the rough idea of my expectations here dealing with those kinds of models.
In the end from this test I've gathered the fact you don't need a H100 at all. It's nice to make 3minute videos - but at the cost of 1/3rd usage of the chips total power, ram, and lets not forget money as we are all renting them. And the L40's and A6000's are fantastic 7-8min rendering video makers at a halfway price point. But a 24gb or a 16gb video card using the 2-3 switches above will still get you 12-13-14min-ish generations at a cost of cents...literal cents on the dollar.
I really appreciate all the tidbits in helping getting to the bottom of what this model's requirements and usage are really like. I'll edit the above title and clean it up a bit to archive for information.