Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,271 @@
|
|
1 |
-
---
|
2 |
-
license: other
|
3 |
-
license_name: tencent-hunyuan-community
|
4 |
-
license_link: LICENSE
|
5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: tencent-hunyuan-community
|
4 |
+
license_link: LICENSE
|
5 |
+
---
|
6 |
+
|
7 |
+
<!-- ## **HunyuanVideo** -->
|
8 |
+
|
9 |
+
<p align="center">
|
10 |
+
<img src="assets/logo.png" height=100>
|
11 |
+
</p>
|
12 |
+
|
13 |
+
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
14 |
+
|
15 |
+
-----
|
16 |
+
|
17 |
+
This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring HunyuanVideo. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com).
|
18 |
+
|
19 |
+
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model Training**](https://arxiv.org/abs/2405.08748) <br>
|
20 |
+
|
21 |
+
## π₯π₯π₯ News!!
|
22 |
+
* Dec 3, 2024: π€ We release the inference code and model weights of HunyuanVideo.
|
23 |
+
|
24 |
+
## π Open-source Plan
|
25 |
+
|
26 |
+
- HunyuanVideo (Text-to-Video Model)
|
27 |
+
- [x] Inference
|
28 |
+
- [x] Checkpoints
|
29 |
+
- [ ] Penguin Video Benchmark
|
30 |
+
- [ ] Web Demo (Gradio)
|
31 |
+
- [ ] ComfyUI
|
32 |
+
- [ ] Diffusers
|
33 |
+
- HunyuanVideo (Image-to-Video Model)
|
34 |
+
- [ ] Inference
|
35 |
+
- [ ] Checkpoints
|
36 |
+
|
37 |
+
## Contents
|
38 |
+
- [HunyuanVideo: A Systematic Framework For Large Video Generation Model Training](#hunyuanvideo--a-systematic-framework-for-large-video-generation-model-training)
|
39 |
+
- [π₯π₯π₯ News!!](#-news!!)
|
40 |
+
- [π Open-source Plan](#-open-source-plan)
|
41 |
+
- [Contents](#contents)
|
42 |
+
- [**Abstract**](#abstract)
|
43 |
+
- [**HunyuanVideo Overall Architechture**](#-hunyuanvideo-overall-architechture)
|
44 |
+
- [π **HunyuanVideo Key Features**](#-hunyuanvideo-key-features)
|
45 |
+
- [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
|
46 |
+
- [**MLLM Text Encoder**](#mllm-text-encoder)
|
47 |
+
- [**3D VAE**](#3d-vae)
|
48 |
+
- [**Prompt Rewrite**](#prompt-rewrite)
|
49 |
+
- [π Comparisons](#-comparisons)
|
50 |
+
- [π Requirements](#-requirements)
|
51 |
+
- [π οΈ Dependencies and Installation](#-dependencies-and-installation)
|
52 |
+
- [Installation Guide for Linux](#installation-guide-for-linux)
|
53 |
+
- [𧱠Download Pretrained Models](#-download-pretrained-models)
|
54 |
+
- [π Inference](#-inference)
|
55 |
+
- [Using Command Line](#using-command-line)
|
56 |
+
- [More Configurations](#more-configurations)
|
57 |
+
- [π BibTeX](#-bibtex)
|
58 |
+
- [Acknowledgements](#-acknowledgements)
|
59 |
+
---
|
60 |
+
|
61 |
+
## **Abstract**
|
62 |
+
We present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. HunyuanVideo features a comprehensive framework that integrates several key contributions, including data curation, image-video joint model training, and an efficient infrastructure designed to facilitate large-scale model training and inference. Additionally, through an effective strategy for scaling model architecture and dataset, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models.
|
63 |
+
|
64 |
+
We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion diversity, text-video alignment, and generation stability. According to professional human evaluation results, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and 3 top performing Chinese video generative models. By releasing the code and weights of the foundation model and its applications, we aim to bridge the gap between closed-source and open-source video foundation models. This initiative will empower everyone in the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem.
|
65 |
+
|
66 |
+
## **HunyuanVideo Overall Architechture**
|
67 |
+
|
68 |
+
HunyuanVideo is trained on a spatial-temporally
|
69 |
+
compressed latent space, which is compressed through Causal 3D VAE. Text prompts are encoded
|
70 |
+
using a large language model, and used as the condition. Gaussian noise and condition are taken as
|
71 |
+
input, our generate model generates an output latent, which is decoded to images or videos through
|
72 |
+
the 3D VAE decoder.
|
73 |
+
<p align="center">
|
74 |
+
<img src="assets/overall.png" height=300>
|
75 |
+
</p>
|
76 |
+
|
77 |
+
## π **HunyuanVideo Key Features**
|
78 |
+
### **Unified Image and Video Generative Architecture**
|
79 |
+
HunyuanVideo introduces the Transformer design and employs a Full Attention mechanism for unified image and video generation.
|
80 |
+
Specifically, we use a "Dual-stream to Single-stream" hybrid model design for video generation. In the dual-stream phase, video and text
|
81 |
+
tokens are processed independently through multiple Transformer blocks, enabling each modality to learn its own appropriate modulation mechanisms without interference. In the single-stream phase, we concatenate the video and text
|
82 |
+
tokens and feed them into subsequent Transformer blocks for effective multimodal information fusion.
|
83 |
+
This design captures complex interactions between visual and semantic information, enhancing
|
84 |
+
overall model performance.
|
85 |
+
<p align="center">
|
86 |
+
<img src="assets/backbone.png" height=350>
|
87 |
+
</p>
|
88 |
+
|
89 |
+
### **MLLM Text Encoder**
|
90 |
+
Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as text encoders where CLIP uses Transformer Encoder and T5 uses a Encoder-Decoder structure. In constrast, we utilize a pretrained Multimodal Large Language Model (MLLM) with a Decoder-Only structure as our text encoder, which has following advantages: (i) Compared with T5, MLLM after visual instruction finetuning has better image-text alignment in the feature space, which alleviates the difficulty of instruction following in diffusion models; (ii)
|
91 |
+
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
92 |
+
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
93 |
+
<p align="center">
|
94 |
+
<img src="assets/text_encoder.png" height=275>
|
95 |
+
</p>
|
96 |
+
|
97 |
+
### **3D VAE**
|
98 |
+
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
99 |
+
<p align="center">
|
100 |
+
<img src="assets/3dvae.png" height=150>
|
101 |
+
</p>
|
102 |
+
|
103 |
+
### **Prompt Rewrite**
|
104 |
+
To address the variability in linguistic style and length of user-provided prompts, we fine-tune the [Hunyuan-Large model](https://github.com/Tencent/Tencent-Hunyuan-Large) as our prompt rewrite model to adapt the original user prompt to model-preferred prompt.
|
105 |
+
|
106 |
+
We provide two rewrite modes: Normal mode and Master mode, which can be called using different prompts. The prompts are shown [here](hyvideo/prompt_rewrite.py). The Normal mode is designed to enhance the video generation model's comprehension of user intent, facilitating a more accurate interpretation of the instructions provided. The Master mode enhances the description of aspects such as composition, lighting, and camera movement, which leans towards generating videos with a higher visual quality. However, this emphasis may occasionally result in the loss of some semantic details.
|
107 |
+
|
108 |
+
The Prompt Rewrite Model can be directly deployed and inferred using the [Hunyuan-Large original code](https://github.com/Tencent/Tencent-Hunyuan-Large). We release the weights of the Prompt Rewrite Model [here](https://huggingface.co/Tencent/HunyuanVideo-PromptRewrite).
|
109 |
+
|
110 |
+
## π Comparisons
|
111 |
+
|
112 |
+
To evaluate the performance of HunyuanVideo, we selected five strong baselines from closed-source video generation models. In total, we utilized 1,533 text prompts, generating an equal number of video samples with HunyuanVideo in a single run. For a fair comparison, we conducted inference only once, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models, ensuring consistent video resolution. Videos were assessed based on three criteria: Text Alignment, Motion Quality and Visual Quality. More than 60 professional evaluators performed the evaluation. Notably, HunyuanVideo demonstrated the best overall performance, particularly excelling in motion quality.
|
113 |
+
|
114 |
+
<p align="center">
|
115 |
+
<table>
|
116 |
+
<thead>
|
117 |
+
<tr>
|
118 |
+
<th rowspan="2">Model</th> <th rowspan="2">Open Source</th> <th>Duration</th> <th>Text Alignment</th> <th>Motion Quality</th> <th rowspan="2">Visual Quality</th> <th rowspan="2">Overall</th> <th rowspan="2">Ranking</th>
|
119 |
+
</tr>
|
120 |
+
</thead>
|
121 |
+
<tbody>
|
122 |
+
<tr>
|
123 |
+
<td>HunyuanVideo (Ours)</td> <td> β </td> <td>5s</td> <td>68.5%</td> <td>64.5%</td> <td>96.4%</td> <td>44.7%</td> <td>1</td>
|
124 |
+
</tr>
|
125 |
+
<tr>
|
126 |
+
<td>CNTopA (API)</td> <td> ✘ </td> <td>5s</td> <td>68.8%</td> <td>57.5%</td> <td>95.8%</td> <td>38.8%</td> <td>2</td>
|
127 |
+
</tr>
|
128 |
+
<tr>
|
129 |
+
<td>CNTopB (Web)</td> <td> ✘</td> <td>5s</td> <td>64.5%</td> <td>59.3%</td> <td>97.7%</td> <td>37.6%</td> <td>3</td>
|
130 |
+
</tr>
|
131 |
+
<tr>
|
132 |
+
<td>GEN-3 alpha (Web)</td> <td>✘</td> <td>6s</td> <td>49.3%</td> <td>48.3%</td> <td>97.1%</td> <td>24.6%</td> <td>4</td>
|
133 |
+
</tr>
|
134 |
+
|
135 |
+
<tr>
|
136 |
+
<td>CNTopC (Web)</td> <td>✘</td> <td>5s</td> <td>52.7%</td> <td>42.1%</td> <td>96.2%</td> <td>24.1%</td> <td>5</td>
|
137 |
+
</tr>
|
138 |
+
<tr>
|
139 |
+
<td>Luma1.6 (API)</td><td>✘</td> <td>5s</td> <td>59.7%</td> <td>36.8%</td> <td>93.5%</td> <td>21.6%</td> <td>6</td>
|
140 |
+
</tr>
|
141 |
+
</tbody>
|
142 |
+
</table>
|
143 |
+
</p>
|
144 |
+
|
145 |
+
## π Requirements
|
146 |
+
|
147 |
+
The following table shows the requirements for running HunyuanVideo model (batch size = 1) to generate videos:
|
148 |
+
|
149 |
+
| Model | GPU | Setting<br/>(height/width/frame) | Denoising step | GPU Peak Memory |
|
150 |
+
|:--------------:|:----:|:--------------------------------:|:--------------:|:----------------:|
|
151 |
+
| HunyuanVideo | H800 | 720px1280px129f | 30 | 60G |
|
152 |
+
| HunyuanVideo | H800 | 544px960px129f | 30 | 45G |
|
153 |
+
| HunyuanVideo | H20 | 720px1280px129f | 30 | 60G |
|
154 |
+
| HunyuanVideo | H20 | 544px960px129f | 30 | 45G |
|
155 |
+
|
156 |
+
* An NVIDIA GPU with CUDA support is required.
|
157 |
+
* We have tested on a single H800/H20 GPU.
|
158 |
+
* **Minimum**: The minimum GPU memory required is 60GB for 720px1280px129f and 45G for 544px960px129f.
|
159 |
+
* **Recommended**: We recommend using a GPU with 80GB of memory for better generation quality.
|
160 |
+
* Tested operating system: Linux
|
161 |
+
|
162 |
+
## π οΈ Dependencies and Installation
|
163 |
+
|
164 |
+
Begin by cloning the repository:
|
165 |
+
```shell
|
166 |
+
git clone https://github.com/tencent/HunyuanVideo
|
167 |
+
cd HunyuanVideo
|
168 |
+
```
|
169 |
+
|
170 |
+
### Installation Guide for Linux
|
171 |
+
|
172 |
+
We provide an `environment.yml` file for setting up a Conda environment.
|
173 |
+
Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
|
174 |
+
|
175 |
+
We recommend CUDA versions 11.8 and 12.0+.
|
176 |
+
|
177 |
+
```shell
|
178 |
+
# 1. Prepare conda environment
|
179 |
+
conda env create -f environment.yml
|
180 |
+
|
181 |
+
# 2. Activate the environment
|
182 |
+
conda activate HunyuanVideo
|
183 |
+
|
184 |
+
# 3. Install pip dependencies
|
185 |
+
python -m pip install -r requirements.txt
|
186 |
+
|
187 |
+
# 4. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
|
188 |
+
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.9.post1
|
189 |
+
```
|
190 |
+
|
191 |
+
Additionally, to simplify the training process, HunyuanVideo also provides a pre-built Docker image:
|
192 |
+
[docker_hunyuanvideo](https://hub.docker.com/repository/docker/hunyuanvideo/hunyuanvideo/general).
|
193 |
+
|
194 |
+
```shell
|
195 |
+
# 1. Use the following link to download the docker image tar file (For CUDA 12).
|
196 |
+
wget https://aivideo.hunyuan.tencent.com/download/HunyuanVideo/hunyuan_video_cu12.tar
|
197 |
+
|
198 |
+
# 2. Import the docker tar file and show the image meta information (For CUDA 12).
|
199 |
+
docker load -i hunyuan_video.tar
|
200 |
+
|
201 |
+
docker image ls
|
202 |
+
|
203 |
+
# 3. Run the container based on the image
|
204 |
+
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged docker_image_tag
|
205 |
+
```
|
206 |
+
|
207 |
+
|
208 |
+
## 𧱠Download Pretrained Models
|
209 |
+
|
210 |
+
The details of download pretrained models are shown [here](ckpts/README.md).
|
211 |
+
|
212 |
+
## π Inference
|
213 |
+
We list the height/width/frame settings we support in the following table.
|
214 |
+
|
215 |
+
| Resolution | h/w=9:16 | h/w=16:9 | h/w=4:3 | h/w=3:4 | h/w=1:1 |
|
216 |
+
|:---------------------:|:----------------------------:|:---------------:|:---------------:|:---------------:|:---------------:|
|
217 |
+
| 540p | 544px960px129f | 960px544px129f | 624px832px129f | 832px624px129f | 720px720px129f |
|
218 |
+
| 720p (recommended) | 720px1280px129f | 1280px720px129f | 1104px832px129f | 832px1104px129f | 960px960px129f |
|
219 |
+
|
220 |
+
### Using Command Line
|
221 |
+
|
222 |
+
```bash
|
223 |
+
cd HunyuanVideo
|
224 |
+
|
225 |
+
python3 sample_video.py \
|
226 |
+
--video-size 720 1280 \
|
227 |
+
--video-length 129 \
|
228 |
+
--infer-steps 30 \
|
229 |
+
--prompt "a cat is running, realistic." \
|
230 |
+
--flow-reverse \
|
231 |
+
--seed 0 \
|
232 |
+
--use-cpu-offload \
|
233 |
+
--save-path ./results
|
234 |
+
```
|
235 |
+
|
236 |
+
### More Configurations
|
237 |
+
|
238 |
+
We list some more useful configurations for easy usage:
|
239 |
+
|
240 |
+
| Argument | Default | Description |
|
241 |
+
|:----------------------:|:---------:|:-----------------------------------------:|
|
242 |
+
| `--prompt` | None | The text prompt for video generation |
|
243 |
+
| `--video-size` | 720 1280 | The size of the generated video |
|
244 |
+
| `--video-length` | 129 | The length of the generated video |
|
245 |
+
| `--infer-steps` | 30 | The number of steps for sampling |
|
246 |
+
| `--embedded-cfg-scale` | 6.0 | Embeded Classifier free guidance scale |
|
247 |
+
| `--flow-shift` | 9.0 | Shift factor for flow matching schedulers |
|
248 |
+
| `--flow-reverse` | False | If reverse, learning/sampling from t=1 -> t=0 |
|
249 |
+
| `--neg-prompt` | None | The negative prompt for video generation |
|
250 |
+
| `--seed` | 0 | The random seed for generating video |
|
251 |
+
| `--use-cpu-offload` | False | Use CPU offload for the model load to save more memory, necessary for high-res video generation |
|
252 |
+
| `--save-path` | ./results | Path to save the generated video |
|
253 |
+
|
254 |
+
|
255 |
+
## π BibTeX
|
256 |
+
If you find [HunyuanVideo](https://arxiv.org/abs/2405.08748) useful for your research and applications, please cite using this BibTeX:
|
257 |
+
|
258 |
+
```BibTeX
|
259 |
+
@misc{XX,
|
260 |
+
title={HunyuanVideo: A Systematic Framework For Large Video Generation Model Training},
|
261 |
+
author={Hunyuan Foundation Model Team},
|
262 |
+
year={2025},
|
263 |
+
eprint={XXX},
|
264 |
+
archivePrefix={arXiv},
|
265 |
+
primaryClass={cs.CV}
|
266 |
+
}
|
267 |
+
```
|
268 |
+
|
269 |
+
## Acknowledgements
|
270 |
+
We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [huggingface](https://huggingface.co) repositories, for their open research and exploration.
|
271 |
+
Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.
|