Update README.md
Browse files
README.md
CHANGED
@@ -71,7 +71,7 @@ using a large language model, and used as the condition. Gaussian noise and cond
|
|
71 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
72 |
the 3D VAE decoder.
|
73 |
<p align="center">
|
74 |
-
<img src="https://
|
75 |
</p>
|
76 |
|
77 |
## 🎉 **HunyuanVideo Key Features**
|
@@ -83,7 +83,7 @@ tokens and feed them into subsequent Transformer blocks for effective multimodal
|
|
83 |
This design captures complex interactions between visual and semantic information, enhancing
|
84 |
overall model performance.
|
85 |
<p align="center">
|
86 |
-
<img src="https://
|
87 |
</p>
|
88 |
|
89 |
### **MLLM Text Encoder**
|
@@ -91,13 +91,13 @@ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as te
|
|
91 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
92 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
93 |
<p align="center">
|
94 |
-
<img src="https://
|
95 |
</p>
|
96 |
|
97 |
### **3D VAE**
|
98 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
99 |
<p align="center">
|
100 |
-
<img src="https://
|
101 |
</p>
|
102 |
|
103 |
### **Prompt Rewrite**
|
|
|
71 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
72 |
the 3D VAE decoder.
|
73 |
<p align="center">
|
74 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/overall.png" height=300>
|
75 |
</p>
|
76 |
|
77 |
## 🎉 **HunyuanVideo Key Features**
|
|
|
83 |
This design captures complex interactions between visual and semantic information, enhancing
|
84 |
overall model performance.
|
85 |
<p align="center">
|
86 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/backbone.png" height=350>
|
87 |
</p>
|
88 |
|
89 |
### **MLLM Text Encoder**
|
|
|
91 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
92 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
93 |
<p align="center">
|
94 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/text_encoder.png" height=275>
|
95 |
</p>
|
96 |
|
97 |
### **3D VAE**
|
98 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
99 |
<p align="center">
|
100 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/3dvae.png" height=150>
|
101 |
</p>
|
102 |
|
103 |
### **Prompt Rewrite**
|