Text-to-Video
noaltian commited on
Commit
799a108
Β·
verified Β·
1 Parent(s): 835f45e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +271 -5
README.md CHANGED
@@ -1,5 +1,271 @@
1
- ---
2
- license: other
3
- license_name: tencent-hunyuan-community
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: tencent-hunyuan-community
4
+ license_link: LICENSE
5
+ ---
6
+
7
+ <!-- ## **HunyuanVideo** -->
8
+
9
+ <p align="center">
10
+ <img src="assets/logo.png" height=100>
11
+ </p>
12
+
13
+ # HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
14
+
15
+ -----
16
+
17
+ This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring HunyuanVideo. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com).
18
+
19
+ > [**HunyuanVideo: A Systematic Framework For Large Video Generation Model Training**](https://arxiv.org/abs/2405.08748) <br>
20
+
21
+ ## πŸ”₯πŸ”₯πŸ”₯ News!!
22
+ * Dec 3, 2024: πŸ€— We release the inference code and model weights of HunyuanVideo.
23
+
24
+ ## πŸ“‘ Open-source Plan
25
+
26
+ - HunyuanVideo (Text-to-Video Model)
27
+ - [x] Inference
28
+ - [x] Checkpoints
29
+ - [ ] Penguin Video Benchmark
30
+ - [ ] Web Demo (Gradio)
31
+ - [ ] ComfyUI
32
+ - [ ] Diffusers
33
+ - HunyuanVideo (Image-to-Video Model)
34
+ - [ ] Inference
35
+ - [ ] Checkpoints
36
+
37
+ ## Contents
38
+ - [HunyuanVideo: A Systematic Framework For Large Video Generation Model Training](#hunyuanvideo--a-systematic-framework-for-large-video-generation-model-training)
39
+ - [πŸ”₯πŸ”₯πŸ”₯ News!!](#-news!!)
40
+ - [πŸ“‘ Open-source Plan](#-open-source-plan)
41
+ - [Contents](#contents)
42
+ - [**Abstract**](#abstract)
43
+ - [**HunyuanVideo Overall Architechture**](#-hunyuanvideo-overall-architechture)
44
+ - [πŸŽ‰ **HunyuanVideo Key Features**](#-hunyuanvideo-key-features)
45
+ - [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
46
+ - [**MLLM Text Encoder**](#mllm-text-encoder)
47
+ - [**3D VAE**](#3d-vae)
48
+ - [**Prompt Rewrite**](#prompt-rewrite)
49
+ - [πŸ“ˆ Comparisons](#-comparisons)
50
+ - [πŸ“œ Requirements](#-requirements)
51
+ - [πŸ› οΈ Dependencies and Installation](#-dependencies-and-installation)
52
+ - [Installation Guide for Linux](#installation-guide-for-linux)
53
+ - [🧱 Download Pretrained Models](#-download-pretrained-models)
54
+ - [πŸ”‘ Inference](#-inference)
55
+ - [Using Command Line](#using-command-line)
56
+ - [More Configurations](#more-configurations)
57
+ - [πŸ”— BibTeX](#-bibtex)
58
+ - [Acknowledgements](#-acknowledgements)
59
+ ---
60
+
61
+ ## **Abstract**
62
+ We present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. HunyuanVideo features a comprehensive framework that integrates several key contributions, including data curation, image-video joint model training, and an efficient infrastructure designed to facilitate large-scale model training and inference. Additionally, through an effective strategy for scaling model architecture and dataset, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models.
63
+
64
+ We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion diversity, text-video alignment, and generation stability. According to professional human evaluation results, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and 3 top performing Chinese video generative models. By releasing the code and weights of the foundation model and its applications, we aim to bridge the gap between closed-source and open-source video foundation models. This initiative will empower everyone in the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem.
65
+
66
+ ## **HunyuanVideo Overall Architechture**
67
+
68
+ HunyuanVideo is trained on a spatial-temporally
69
+ compressed latent space, which is compressed through Causal 3D VAE. Text prompts are encoded
70
+ using a large language model, and used as the condition. Gaussian noise and condition are taken as
71
+ input, our generate model generates an output latent, which is decoded to images or videos through
72
+ the 3D VAE decoder.
73
+ <p align="center">
74
+ <img src="assets/overall.png" height=300>
75
+ </p>
76
+
77
+ ## πŸŽ‰ **HunyuanVideo Key Features**
78
+ ### **Unified Image and Video Generative Architecture**
79
+ HunyuanVideo introduces the Transformer design and employs a Full Attention mechanism for unified image and video generation.
80
+ Specifically, we use a "Dual-stream to Single-stream" hybrid model design for video generation. In the dual-stream phase, video and text
81
+ tokens are processed independently through multiple Transformer blocks, enabling each modality to learn its own appropriate modulation mechanisms without interference. In the single-stream phase, we concatenate the video and text
82
+ tokens and feed them into subsequent Transformer blocks for effective multimodal information fusion.
83
+ This design captures complex interactions between visual and semantic information, enhancing
84
+ overall model performance.
85
+ <p align="center">
86
+ <img src="assets/backbone.png" height=350>
87
+ </p>
88
+
89
+ ### **MLLM Text Encoder**
90
+ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as text encoders where CLIP uses Transformer Encoder and T5 uses a Encoder-Decoder structure. In constrast, we utilize a pretrained Multimodal Large Language Model (MLLM) with a Decoder-Only structure as our text encoder, which has following advantages: (i) Compared with T5, MLLM after visual instruction finetuning has better image-text alignment in the feature space, which alleviates the difficulty of instruction following in diffusion models; (ii)
91
+ Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
92
+ and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
93
+ <p align="center">
94
+ <img src="assets/text_encoder.png" height=275>
95
+ </p>
96
+
97
+ ### **3D VAE**
98
+ HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
99
+ <p align="center">
100
+ <img src="assets/3dvae.png" height=150>
101
+ </p>
102
+
103
+ ### **Prompt Rewrite**
104
+ To address the variability in linguistic style and length of user-provided prompts, we fine-tune the [Hunyuan-Large model](https://github.com/Tencent/Tencent-Hunyuan-Large) as our prompt rewrite model to adapt the original user prompt to model-preferred prompt.
105
+
106
+ We provide two rewrite modes: Normal mode and Master mode, which can be called using different prompts. The prompts are shown [here](hyvideo/prompt_rewrite.py). The Normal mode is designed to enhance the video generation model's comprehension of user intent, facilitating a more accurate interpretation of the instructions provided. The Master mode enhances the description of aspects such as composition, lighting, and camera movement, which leans towards generating videos with a higher visual quality. However, this emphasis may occasionally result in the loss of some semantic details.
107
+
108
+ The Prompt Rewrite Model can be directly deployed and inferred using the [Hunyuan-Large original code](https://github.com/Tencent/Tencent-Hunyuan-Large). We release the weights of the Prompt Rewrite Model [here](https://huggingface.co/Tencent/HunyuanVideo-PromptRewrite).
109
+
110
+ ## πŸ“ˆ Comparisons
111
+
112
+ To evaluate the performance of HunyuanVideo, we selected five strong baselines from closed-source video generation models. In total, we utilized 1,533 text prompts, generating an equal number of video samples with HunyuanVideo in a single run. For a fair comparison, we conducted inference only once, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models, ensuring consistent video resolution. Videos were assessed based on three criteria: Text Alignment, Motion Quality and Visual Quality. More than 60 professional evaluators performed the evaluation. Notably, HunyuanVideo demonstrated the best overall performance, particularly excelling in motion quality.
113
+
114
+ <p align="center">
115
+ <table>
116
+ <thead>
117
+ <tr>
118
+ <th rowspan="2">Model</th> <th rowspan="2">Open Source</th> <th>Duration</th> <th>Text Alignment</th> <th>Motion Quality</th> <th rowspan="2">Visual Quality</th> <th rowspan="2">Overall</th> <th rowspan="2">Ranking</th>
119
+ </tr>
120
+ </thead>
121
+ <tbody>
122
+ <tr>
123
+ <td>HunyuanVideo (Ours)</td> <td> βœ” </td> <td>5s</td> <td>68.5%</td> <td>64.5%</td> <td>96.4%</td> <td>44.7%</td> <td>1</td>
124
+ </tr>
125
+ <tr>
126
+ <td>CNTopA (API)</td> <td> &#10008 </td> <td>5s</td> <td>68.8%</td> <td>57.5%</td> <td>95.8%</td> <td>38.8%</td> <td>2</td>
127
+ </tr>
128
+ <tr>
129
+ <td>CNTopB (Web)</td> <td> &#10008</td> <td>5s</td> <td>64.5%</td> <td>59.3%</td> <td>97.7%</td> <td>37.6%</td> <td>3</td>
130
+ </tr>
131
+ <tr>
132
+ <td>GEN-3 alpha (Web)</td> <td>&#10008</td> <td>6s</td> <td>49.3%</td> <td>48.3%</td> <td>97.1%</td> <td>24.6%</td> <td>4</td>
133
+ </tr>
134
+
135
+ <tr>
136
+ <td>CNTopC (Web)</td> <td>&#10008</td> <td>5s</td> <td>52.7%</td> <td>42.1%</td> <td>96.2%</td> <td>24.1%</td> <td>5</td>
137
+ </tr>
138
+ <tr>
139
+ <td>Luma1.6 (API)</td><td>&#10008</td> <td>5s</td> <td>59.7%</td> <td>36.8%</td> <td>93.5%</td> <td>21.6%</td> <td>6</td>
140
+ </tr>
141
+ </tbody>
142
+ </table>
143
+ </p>
144
+
145
+ ## πŸ“œ Requirements
146
+
147
+ The following table shows the requirements for running HunyuanVideo model (batch size = 1) to generate videos:
148
+
149
+ | Model | GPU | Setting<br/>(height/width/frame) | Denoising step | GPU Peak Memory |
150
+ |:--------------:|:----:|:--------------------------------:|:--------------:|:----------------:|
151
+ | HunyuanVideo | H800 | 720px1280px129f | 30 | 60G |
152
+ | HunyuanVideo | H800 | 544px960px129f | 30 | 45G |
153
+ | HunyuanVideo | H20 | 720px1280px129f | 30 | 60G |
154
+ | HunyuanVideo | H20 | 544px960px129f | 30 | 45G |
155
+
156
+ * An NVIDIA GPU with CUDA support is required.
157
+ * We have tested on a single H800/H20 GPU.
158
+ * **Minimum**: The minimum GPU memory required is 60GB for 720px1280px129f and 45G for 544px960px129f.
159
+ * **Recommended**: We recommend using a GPU with 80GB of memory for better generation quality.
160
+ * Tested operating system: Linux
161
+
162
+ ## πŸ› οΈ Dependencies and Installation
163
+
164
+ Begin by cloning the repository:
165
+ ```shell
166
+ git clone https://github.com/tencent/HunyuanVideo
167
+ cd HunyuanVideo
168
+ ```
169
+
170
+ ### Installation Guide for Linux
171
+
172
+ We provide an `environment.yml` file for setting up a Conda environment.
173
+ Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
174
+
175
+ We recommend CUDA versions 11.8 and 12.0+.
176
+
177
+ ```shell
178
+ # 1. Prepare conda environment
179
+ conda env create -f environment.yml
180
+
181
+ # 2. Activate the environment
182
+ conda activate HunyuanVideo
183
+
184
+ # 3. Install pip dependencies
185
+ python -m pip install -r requirements.txt
186
+
187
+ # 4. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
188
+ python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.9.post1
189
+ ```
190
+
191
+ Additionally, to simplify the training process, HunyuanVideo also provides a pre-built Docker image:
192
+ [docker_hunyuanvideo](https://hub.docker.com/repository/docker/hunyuanvideo/hunyuanvideo/general).
193
+
194
+ ```shell
195
+ # 1. Use the following link to download the docker image tar file (For CUDA 12).
196
+ wget https://aivideo.hunyuan.tencent.com/download/HunyuanVideo/hunyuan_video_cu12.tar
197
+
198
+ # 2. Import the docker tar file and show the image meta information (For CUDA 12).
199
+ docker load -i hunyuan_video.tar
200
+
201
+ docker image ls
202
+
203
+ # 3. Run the container based on the image
204
+ docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged docker_image_tag
205
+ ```
206
+
207
+
208
+ ## 🧱 Download Pretrained Models
209
+
210
+ The details of download pretrained models are shown [here](ckpts/README.md).
211
+
212
+ ## πŸ”‘ Inference
213
+ We list the height/width/frame settings we support in the following table.
214
+
215
+ | Resolution | h/w=9:16 | h/w=16:9 | h/w=4:3 | h/w=3:4 | h/w=1:1 |
216
+ |:---------------------:|:----------------------------:|:---------------:|:---------------:|:---------------:|:---------------:|
217
+ | 540p | 544px960px129f | 960px544px129f | 624px832px129f | 832px624px129f | 720px720px129f |
218
+ | 720p (recommended) | 720px1280px129f | 1280px720px129f | 1104px832px129f | 832px1104px129f | 960px960px129f |
219
+
220
+ ### Using Command Line
221
+
222
+ ```bash
223
+ cd HunyuanVideo
224
+
225
+ python3 sample_video.py \
226
+ --video-size 720 1280 \
227
+ --video-length 129 \
228
+ --infer-steps 30 \
229
+ --prompt "a cat is running, realistic." \
230
+ --flow-reverse \
231
+ --seed 0 \
232
+ --use-cpu-offload \
233
+ --save-path ./results
234
+ ```
235
+
236
+ ### More Configurations
237
+
238
+ We list some more useful configurations for easy usage:
239
+
240
+ | Argument | Default | Description |
241
+ |:----------------------:|:---------:|:-----------------------------------------:|
242
+ | `--prompt` | None | The text prompt for video generation |
243
+ | `--video-size` | 720 1280 | The size of the generated video |
244
+ | `--video-length` | 129 | The length of the generated video |
245
+ | `--infer-steps` | 30 | The number of steps for sampling |
246
+ | `--embedded-cfg-scale` | 6.0 | Embeded Classifier free guidance scale |
247
+ | `--flow-shift` | 9.0 | Shift factor for flow matching schedulers |
248
+ | `--flow-reverse` | False | If reverse, learning/sampling from t=1 -> t=0 |
249
+ | `--neg-prompt` | None | The negative prompt for video generation |
250
+ | `--seed` | 0 | The random seed for generating video |
251
+ | `--use-cpu-offload` | False | Use CPU offload for the model load to save more memory, necessary for high-res video generation |
252
+ | `--save-path` | ./results | Path to save the generated video |
253
+
254
+
255
+ ## πŸ”— BibTeX
256
+ If you find [HunyuanVideo](https://arxiv.org/abs/2405.08748) useful for your research and applications, please cite using this BibTeX:
257
+
258
+ ```BibTeX
259
+ @misc{XX,
260
+ title={HunyuanVideo: A Systematic Framework For Large Video Generation Model Training},
261
+ author={Hunyuan Foundation Model Team},
262
+ year={2025},
263
+ eprint={XXX},
264
+ archivePrefix={arXiv},
265
+ primaryClass={cs.CV}
266
+ }
267
+ ```
268
+
269
+ ## Acknowledgements
270
+ We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [huggingface](https://huggingface.co) repositories, for their open research and exploration.
271
+ Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.