Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,7 @@ license_link: LICENSE
|
|
7 |
<!-- ## **HunyuanVideo** -->
|
8 |
|
9 |
<p align="center">
|
10 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/logo.png" height=100>
|
11 |
</p>
|
12 |
|
13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
@@ -48,7 +48,7 @@ using a large language model, and used as the condition. Gaussian noise and cond
|
|
48 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
49 |
the 3D VAE decoder.
|
50 |
<p align="center">
|
51 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/overall.png" height=300>
|
52 |
</p>
|
53 |
|
54 |
## 🎉 **HunyuanVideo Key Features**
|
@@ -60,7 +60,7 @@ tokens and feed them into subsequent Transformer blocks for effective multimodal
|
|
60 |
This design captures complex interactions between visual and semantic information, enhancing
|
61 |
overall model performance.
|
62 |
<p align="center">
|
63 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/backbone.png" height=350>
|
64 |
</p>
|
65 |
|
66 |
### **MLLM Text Encoder**
|
@@ -68,13 +68,13 @@ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as te
|
|
68 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
69 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
70 |
<p align="center">
|
71 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/text_encoder.png" height=275>
|
72 |
</p>
|
73 |
|
74 |
### **3D VAE**
|
75 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
76 |
<p align="center">
|
77 |
-
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/3dvae.png" height=150>
|
78 |
</p>
|
79 |
|
80 |
### **Prompt Rewrite**
|
|
|
7 |
<!-- ## **HunyuanVideo** -->
|
8 |
|
9 |
<p align="center">
|
10 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/logo.png" height=100>
|
11 |
</p>
|
12 |
|
13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
|
|
48 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
49 |
the 3D VAE decoder.
|
50 |
<p align="center">
|
51 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/overall.png" height=300>
|
52 |
</p>
|
53 |
|
54 |
## 🎉 **HunyuanVideo Key Features**
|
|
|
60 |
This design captures complex interactions between visual and semantic information, enhancing
|
61 |
overall model performance.
|
62 |
<p align="center">
|
63 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/backbone.png" height=350>
|
64 |
</p>
|
65 |
|
66 |
### **MLLM Text Encoder**
|
|
|
68 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
69 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
70 |
<p align="center">
|
71 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/text_encoder.png" height=275>
|
72 |
</p>
|
73 |
|
74 |
### **3D VAE**
|
75 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
76 |
<p align="center">
|
77 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/3dvae.png" height=150>
|
78 |
</p>
|
79 |
|
80 |
### **Prompt Rewrite**
|