This repository contains a pruned and isolated pipeline for Stage 2 of StreamingT2V, dubbed "VidXTend."
This model's primary purpose is extending 16-frame 256px x 256x animations by 8 frames at a time (one second at 8fps.)
@article{henschel2024streamingt2v,
title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text},
author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
journal={arXiv preprint arXiv:2403.14773},
year={2024}
}
Usage
Installation
First, install the VidXTend package into your python environment. If you're creating a new environment for VidXTend, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.
pip install git+https://github.com/painebenjamin/vidxtend.git
Command-Line
A command-line utility vidxtend
is installed with the package.
Usage: vidxtend [OPTIONS] VIDEO PROMPT
Run VidXtend on a video file, concatenating the generated frames to the end
of the video.
Options:
-fps, --frame-rate INTEGER Video FPS. Will default to the input FPS.
-s, --seconds FLOAT The total number of seconds to add to the
video. Multiply this number by frame rate to
determine total number of new frames
generated. [default: 1.0]
-np, --negative-prompt TEXT Negative prompt for the diffusion process.
-cfg, --guidance-scale FLOAT Guidance scale for the diffusion process.
[default: 7.5]
-ns, --num-inference-steps INTEGER
Number of diffusion steps. [default: 50]
-r, --seed INTEGER Random seed.
-m, --model TEXT HuggingFace model name.
-nh, --no-half Do not use half precision.
-no, --no-offload Do not offload to the CPU to preserve GPU
memory.
-ns, --no-slicing Do not use VAE slicing.
-g, --gpu-id INTEGER GPU ID to use.
-sf, --model-single-file Download and use a single file instead of a
directory.
-cf, --config-file TEXT Config file to use when using the model-
single-file option. Accepts a path or a
filename in the same directory as the single
file. Will download from the repository
passed in the model option if not provided.
[default: config.json]
-mf, --model-filename TEXT The model file to download when using the
model-single-file option. [default:
vidxtend.safetensors]
-rs, --remote-subfolder TEXT Remote subfolder to download from when using
the model-single-file option.
-cd, --cache-dir DIRECTORY Cache directory to download to. Default uses
the huggingface cache.
-o, --output FILE Output file. [default: output.mp4]
-f, --fit [actual|cover|contain|stretch]
Image fit mode. [default: cover]
-a, --anchor [top-left|top-center|top-right|center-left|center-center|center-right|bottom-left|bottom-center|bottom-right]
Image anchor point. [default: top-left]
--help Show this message and exit.
Python
You can create the pipeline, automatically pulling the weights from this repository, either as individual models:
from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_pretrained(
"benjamin-paine/vidxtend",
torch_dtype=torch.float16,
variant="fp16",
)
Or, as a single file:
from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_single_file(
"benjamin-paine/vidxtend",
torch_dtype=torch.float16,
variant="fp16",
)
Use these methods to improve performance:
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
pipeline.set_use_memory_efficient_attention_xformers()
Usage is as follows:
# Assume images is a list of PIL Images
new_frames = pipeline(
prompt=prompt,
negative_prompt=None, # Optionally use negative prompt
image=images[-8:], # Use final 8 frames of video
input_frames_conditioning=images[:1], # Use first frame of video
eta=1.0,
guidance_scale=7.5,
output_type="pil"
).frames[8:] # Remove the first 8 frames from the output as they were used as guide for final 8
- Downloads last month
- 10