Is this just finetuned SD models?

#13

by replicantarmy - opened 22 days ago

22 days ago

Seems a bit hacked together like an SD finetuned version...no mulit-gpu support, no quants, only 40gb+ cards supported...and from a video editing company looking to pivot...tell me I'm wrong...

ztsvvstz

21 days ago

Cant quite agree.
Got 720x512x257 working on a rtx 3090 in under a minute
1280x720x257 in under 5 minutes and the results are better than any previous 5b+ models as far as I can tell
It seems the sequential cpu offloading is broken in the pipeline tho, if you edit it and manually do that it works quite well and fast under 24gb vram.
Can agree on the Video editing tho, would be cool to have video continuation and input video editing

replicantarmy

21 days ago

Excellent news, what specific edits did you make? Have you tried CogVideo-5b 1.5? I think it's actually better and even pushing out at 8fps it's easy to upscale with something like topaz to 60fps

ztsvvstz

21 days ago

Well the edits I did are what you'd call "spaghetti code"
Basically, I removed all the .to("cuda") calls from the inference.py
and then added this to the pipeline_ltx_video.py:

self.transformer = self.transformer.to("cpu")
self.vae = self.vae.to("cpu")
self.text_encoder = self.text_encoder.to("cuda")

Basically for each step put the thing that is currently in use last and on cuda
so
First the text encoder, then the transformer and finally the vae
I can post my whole inference and ltx pipeline .py here if you want, but maybe there'll be better "official" support for this.

Its basically what pipe.enable_sequential_cpu_offload() should do but yea it somehow does not work rn (I think?)

Yes tried CogVideo 5b, tbh they both are "bad" the thing that makes them interesting is the generation speed.
This model here can generate longer smoother videos quite faster.
But the prompt following in both models is very domain specific.

So prompts like in the format "A character.... like from a TV movie" work good while more surreal experimental prompts do not work well.
Also fast motion is not really good like "A dancing crowd" causes alot of artefacts.
But Ive found this to be the case for any model with lower params up to 5b yet

replicantarmy

21 days ago

Well a hat tip to you ser, thanks for the code, not spaghetti at all! I will definitely try this again...also, have you tried hailuoai? I ran them for a solid month, pretty good at aerial nature shots.

ztsvvstz

21 days ago

Here's the code btw (pastebin was down so I had to use this site)
pipeline_ltx_video: https://justpaste.it/gxldw

inference.py: https://justpaste.it/bhwuj

Those are really just quick hax to get it working, hopefully they update the enable_sequential_cpu_offload() sometime
Havent tried any video service yet, Im more interested to get these models running locally in realtime for exhibitions :)
But it seems animatediff if still the best for that use case

replicantarmy

21 days ago

outstanding work! I'll be running overnights tests. Have you tried the DiT integration yet? As this seem helpful...also, if you haven't seen this already... https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py ...it's a good idea to scale a prompt creation tool with a system prompt to automate workflows...cheers!

Dr-Connor

12 days ago

Cant quite agree.
Got 720x512x257 working on a rtx 3090 in under a minute
1280x720x257 in under 5 minutes and the results are better than any previous 5b+ models as far as I can tell
It seems the sequential cpu offloading is broken in the pipeline tho, if you edit it and manually do that it works quite well and fast under 24gb vram.
Can agree on the Video editing tho, would be cool to have video continuation and input video editing

How many parameters is the model overall? I can't see it anywhere on the model card.

Dr-Connor

12 days ago

Excellent news, what specific edits did you make? Have you tried CogVideo-5b 1.5? I think it's actually better and even pushing out at 8fps it's easy to upscale with something like topaz to 60fps

What's topaz?

YaTharThShaRma999

11 days ago

@Dr-Connor The model is just 2b params for the dit, the t5 text encoder is somewhere between 3-4b params. And topaz is a closed source video upscaler, probably the best.

@replicantarmy Cogvideo has a lot more control right now(controlnets, Lora’s, dimensionx, consisid, fun models, Tora and more) but Ltxv is lightning fast and with the new tricks(video compression, stg). Check here: https://github.com/logtd/ComfyUI-LTXTricks

The speed itself is ridiculous, on a 4090, it takes like 15 sec for a 5sec video. With the tricks, I think it probably surpasses cogvideo on image2vid so quality and is way faster.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment