HunyuanDiT
Diffusers
Safetensors
English
Chinese
Tencent-Hunyuan commited on
Commit
b746e39
1 Parent(s): fe3965a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +194 -0
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- ## **HunyuanDiT** -->
2
+ <!-- [[Technical Report]()] &emsp; [[Project Page]()] &emsp; [[Model Card]()] <br>
3
+
4
+ [[🤗 Demo (Realistic)]()] &emsp; -->
5
+ <p align="center">
6
+ <img src="./asset/logo.png" height=100>
7
+ </p>
8
+
9
+ <div align="center" style="font-size: 30px;font-weight: bold;">Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding</div>
10
+
11
+ <div align="center">
12
+ <a href="https://github.com/Tencent/HunyuanDiT"><img src="https://img.shields.io/static/v1?label=Hunyuan-DiT Code&message=Github&color=blue&logo=github-pages"></a> &ensp;
13
+ <a href="https://dit.hunyuan.tencent.com"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a> &ensp;
14
+ <a href="https://arxiv.org/abs/"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:HunYuan-DiT&color=red&logo=arxiv"></a> &ensp;
15
+ <a href="https://arxiv.org/abs/2403.08857"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:DialogGen&color=red&logo=arxiv"></a> &ensp;
16
+ <a href="https://huggingface.co/Tencent-Hunyuan/Hunyuan-DiT"><img src="https://img.shields.io/static/v1?label=Hunyuan-DiT&message=HuggingFace&color=yellow"></a> &ensp;
17
+
18
+ </div>
19
+
20
+
21
+ <!-- ## Contents
22
+ * [Dependencies and Installation](#-Dependencies-and-Installation)
23
+ * [Inference](#-Inference)
24
+ * [Download Models](#-download-models)
25
+
26
+ * [Acknowledgement](#acknowledgements)
27
+ * [Citation](#bibtex) -->
28
+
29
+ # **Abstract**
30
+
31
+ We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully designed the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-round multi-modal dialogue with users, generating and refining images according to the context.
32
+ Through our carefully designed holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.
33
+
34
+
35
+ # **Hunyuan-DiT Key Features**
36
+ ## **Chinese-English Bilingual DiT Architecture**
37
+ We propose HunyuanDiT, a text-to-image generation model based on Diffusion transformer with fine-grained understanding of Chinese and English. In order to build Hunyuan DiT, we carefully designed the Transformer structure, text encoder and positional encoding. We also built a complete data pipeline from scratch to update and evaluate data to help model optimization iterations. To achieve fine-grained text understanding, we train a multi-modal large language model to optimize text descriptions of images. Ultimately, Hunyuan DiT is able to conduct multiple rounds of dialogue with users, generating and improving images based on context.
38
+ <p align="center">
39
+ <img src="./asset/framework.png" height=500>
40
+ </p>
41
+
42
+ ## **Multi-turn Text2Image Generation**
43
+ Understanding natural language instructions and performing multi-turn interaction with users are important for a
44
+ text-to-image system. It can help build a dynamic and iterative creation process that bring the user’s idea into reality
45
+ step by step. In this section, we will detail how we empower Hunyuan-DiT with the ability to perform multi-round
46
+ conversations and image generation. We train MLLM to understand the multi-round user dialogue
47
+ and output the new text prompt for image generation.
48
+ <p align="center">
49
+ <img src="./asset/mllm.png" height=300>
50
+ </p>
51
+
52
+ ## **Comparisons**
53
+ In order to comprehensively compare the generation capabilities of HunyuanDiT and other models, we constructed a 4-dimensional test set, including Text-Image Consistency, Excluding AI Artifacts, Subject Clarity, Aesthetic. More than 50 professional evaluators performs the evaluation.
54
+
55
+ <p align="center">
56
+ <table>
57
+ <thead>
58
+ <tr>
59
+ <th rowspan="2">Type</th> <th rowspan="2">Model</th> <th>Text-Image Consistency (%)</th> <th>Excluding AI Artifacts (%)</th> <th>Subject Clarity (%)</th> <th rowspan="2">Aesthetics (%)</th> <th rowspan="2">Overall (%)</th>
60
+ </tr>
61
+ </thead>
62
+ <tbody>
63
+ <tr>
64
+ <td rowspan="3">Open Source</td>
65
+ <td>SDXL</td> <td>64.3</td> <td>60.6</td> <td>91.1</td> <td>76.3</td> <td>42.7</td>
66
+ </tr>
67
+ <tr>
68
+ <td>Playground 2.5</td> <td>71.9</td> <td>70.8</td> <td>94.9</td> <td>83.3</td> <td>54.3</td>
69
+ </tr>
70
+ <tr style="font-weight: bold; background-color: #f2f2f2;"> <td>Hunyuan-DiT</td> <td>74.2</td> <td>74.3</td> <td>95.4</td> <td>86.6</td> <td>59.0</td> </tr>
71
+ <tr>
72
+ <td rowspan="3">Closed Source</td>
73
+ <td>SD 3</td> <td>77.1</td> <td>69.3</td> <td>94.6</td> <td>82.5</td> <td>56.7</td>
74
+
75
+ </tr>
76
+ <tr>
77
+ <td>MidJourney v6</td> <td>73.5</td> <td>80.2</td> <td>93.5</td> <td>87.2</td> <td>63.3</td>
78
+ </tr>
79
+ <tr>
80
+ <td>DALL-E 3</td> <td>83.9</td> <td>80.3</td> <td>96.5</td> <td>89.4</td> <td>71.0</td>
81
+ </tr>
82
+ </table>
83
+ </p>
84
+
85
+ ## **Visualization**
86
+
87
+ * **Chinese Elements**
88
+ <p align="center">
89
+ <img src="./asset/chinese elements understanding.png" height=280>
90
+ </p>
91
+
92
+ * **Long Text Input**
93
+
94
+
95
+ <p align="center">
96
+ <img src="./asset/long text understanding.png" height=900>
97
+ <figcaption>Comparison between Hunyuan-DiT and other text-to-image models. The image with the highest resolution on the far left is the result of Hunyuan-Dit. The others, from top left to bottom right, are as follows: Dalle3, Midjourney v6, SD3, Playground 2.5, PixArt, SDXL, Baidu Yige, WanXiang.
98
+ </p>
99
+
100
+ * **Multi-turn Text2Image Generation**
101
+ <p align="center">
102
+ <a href="https://prc-videoframe-pub-1258344703.cos.ap-guangzhou.myqcloud.com/ad_creative_engine/projectpage/1deab38689342431e63606e01e16961c.mov">
103
+ <img src="./asset/cover.png" alt="Watch the video" height="800">
104
+ </a>
105
+ </p>
106
+
107
+ # **Dependencies and Installation**
108
+ Ensure your machine is equipped with a GPU having over 20GB of memory.
109
+
110
+ Begin by cloning the repository:
111
+ ```bash
112
+ git clone https://github.com/tencent/HunyuanDiT
113
+ cd HunyuanDiT
114
+ ```
115
+ We provide an `environment.yml` file for setting up a Conda environment.
116
+
117
+
118
+ Installation instructions for Conda are available [here](https://docs.anaconda.com/free/miniconda/index.html).
119
+
120
+ ```shell
121
+ # Prepare conda environment
122
+ conda env create -f environment.yml
123
+
124
+ # Activate the environment
125
+ conda activate HunyuanDiT
126
+
127
+ # Install pip dependencies
128
+ python -m pip install -r requirements.txt
129
+
130
+ # Install flash attention v2 (for acceleration, requires CUDA 11.6 or above)
131
+ python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3
132
+ ```
133
+
134
+ # **Download Models**
135
+ To download the model, first install the huggingface-cli. Installation instructions are available [here](https://huggingface.co/docs/huggingface_hub/guides/cli):
136
+
137
+ ```sh
138
+ # Create a directory named 'ckpts' where the model will be saved, fulfilling the prerequisites for running the demo.
139
+ mkdir ckpts
140
+ # Use the huggingface-cli tool to download the model.
141
+ # The download time may vary from 10 minutes to 1 hour depending on network conditions.
142
+ huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./ckpts
143
+ ```
144
+ <!-- For more information about the model, visit the Hugging Face repository [here](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT). -->
145
+
146
+
147
+ All models will be automatically downloaded. For more information about the model, visit the Hugging Face repository [here](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT).
148
+
149
+ | Model | #Params | url|
150
+ |:-----------------|:--------|:--------------|
151
+ |mT5 | xxB | [mT5](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/mt5)|
152
+ | CLIP | xxB | [CLIP](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/clip_text_encoder)|
153
+ | DialogGen | 7B | [DialogGen](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/dialoggen)|
154
+ | sdxl-vae-fp16-fix | xxB | [sdxl-vae-fp16-fix](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/sdxl-vae-fp16-fix)|
155
+ | Hunyuan-DiT | xxB | [Hunyuan-DiT](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/tree/main/t2i/model)|
156
+
157
+
158
+
159
+ # **Inference**
160
+ ```bash
161
+ # prompt-enhancement + text2image, torch mode
162
+ python sample_t2i.py --prompt "渔舟唱晚"
163
+
164
+ # close prompt enhancement, torch mode
165
+ python sample_t2i.py --prompt "渔舟唱晚" --no-enhance
166
+
167
+ # close prompt enhancement, flash attention mode
168
+ python sample_t2i.py --infer-mode fa --prompt "渔舟唱晚"
169
+ ```
170
+ more example prompts can be found in [example_prompts.txt](example_prompts.txt)
171
+
172
+ Note: 20G GPU memory is used for sampling in single GPU
173
+
174
+
175
+ <!-- # **To-Do List**
176
+
177
+ - [x] Inference code
178
+ - [ ] Provide Tensorrt engine -->
179
+
180
+
181
+
182
+
183
+
184
+ # **BibTeX**
185
+ If you find Hunyuan-DiT useful for your research and applications, please cite using this BibTeX:
186
+
187
+ ```BibTeX
188
+ @inproceedings{,
189
+ title={},
190
+ author={},
191
+ booktitle={},
192
+ year={2024}
193
+ }
194
+ ```