zR commited on
Commit
70c4b01
·
1 Parent(s): 882ba40
Files changed (2) hide show
  1. README.md +265 -37
  2. README_zh.md +252 -0
README.md CHANGED
@@ -1,51 +1,279 @@
1
  ---
2
- frameworks:
3
- - Pytorch
4
  license: other
5
- tasks:
6
- - image-to-video
 
 
 
 
 
 
 
 
7
 
8
- #model-type:
9
- ##如 gpt、phi、llama、chatglm、baichuan 等
10
- #- gpt
11
 
12
- #domain:
13
- ##如 nlp、cv、audio、multi-modal
14
- #- nlp
 
 
 
 
 
 
 
 
 
 
15
 
16
- #language:
17
- ##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
18
- #- cn
19
 
20
- #metrics:
21
- ##如 CIDEr、Blue、ROUGE
22
- #- CIDEr
23
 
24
- #tags:
25
- ##各种自定义,包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
26
- #- pretrained
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- #tools:
29
- ##如 vllm、fastchat、llamacpp、AdaSeq 等
30
- #- vllm
31
- ---
32
- ### 当前模型的贡献者未提供更加详细的模型介绍。模型文件和权重,可浏览“模型文件”页面获取。
33
- #### 您可以通过如下git clone命令,或者ModelScope SDK来下载模型
 
34
 
35
- SDK下载
36
- ```bash
37
- #安装ModelScope
38
- pip install modelscope
39
  ```
40
- ```python
41
- #SDK模型下载
42
- from modelscope import snapshot_download
43
- model_dir = snapshot_download('ZhipuAI/CogVideoX-5b-I2V')
44
  ```
45
- Git下载
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ```
47
- #Git模型下载
48
- git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-5b-I2V.git
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```
50
 
51
- <p style="color: lightgrey;">如果您是本模型的贡献者,我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>,及时完善模型卡片内容。</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  license: other
3
+ license_link: https://huggingface.co/THUDM/CogVideoX-5b-I2V/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ tags:
7
+ - cogvideox
8
+ - video-generation
9
+ - thudm
10
+ - text-to-video
11
+ inference: false
12
+ ---
13
 
14
+ # CogVideoX-5B-I2V
 
 
15
 
16
+ <p style="text-align: center;">
17
+ <div align="center">
18
+ <img src=https://github.com/THUDM/CogVideo/raw/main/resources/logo.svg width="50%"/>
19
+ </div>
20
+ <p align="center">
21
+ <a href="https://huggingface.co/THUDM//CogVideoX-5b-I2V/blob/main/README.md">📄 Read in English</a> |
22
+ <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space">🤗 Huggingface Space</a> |
23
+ <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
24
+ <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
25
+ </p>
26
+ <p align="center">
27
+ 📍 Visit <a href="https://chatglm.cn/video?fr=osm_cogvideox">Qingying</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> for the commercial version of the video generation model
28
+ </p>
29
 
30
+ ## Model Introduction
 
 
31
 
32
+ CogVideoX is an open-source video generation model originating
33
+ from [Qingying](https://chatglm.cn/video?fr=osm_cogvideo). The table below presents information related to the video
34
+ generation models we offer in this version.
35
 
36
+ <table style="border-collapse: collapse; width: 100%;">
37
+ <tr>
38
+ <th style="text-align: center;">Model Name</th>
39
+ <th style="text-align: center;">CogVideoX-2B</th>
40
+ <th style="text-align: center;">CogVideoX-5B</th>
41
+ <th style="text-align: center;">CogVideoX-5B-I2V (This Repository)</th>
42
+ </tr>
43
+ <tr>
44
+ <td style="text-align: center;">Model Description</td>
45
+ <td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td>
46
+ <td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
47
+ <td style="text-align: center;">CogVideoX-5B image-to-video version.</td>
48
+ </tr>
49
+ <tr>
50
+ <td style="text-align: center;">Inference Precision</td>
51
+ <td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, FP8*, INT8, not supported: INT4</td>
52
+ <td colspan="2" style="text-align: center;"><b>BF16 (recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
53
+ </tr>
54
+ <tr>
55
+ <td style="text-align: center;">Single GPU Memory Usage<br></td>
56
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: from 4GB* </b><br><b>diffusers INT8 (torchao): from 3.6GB*</b></td>
57
+ <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: from 5GB* </b><br><b>diffusers INT8 (torchao): from 4.4GB*</b></td>
58
+ </tr>
59
+ <tr>
60
+ <td style="text-align: center;">Multi-GPU Inference Memory Usage</td>
61
+ <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
62
+ <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
63
+ </tr>
64
+ <tr>
65
+ <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
66
+ <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
67
+ <td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
68
+ </tr>
69
+ <tr>
70
+ <td style="text-align: center;">Fine-tuning Precision</td>
71
+ <td style="text-align: center;"><b>FP16</b></td>
72
+ <td colspan="2" style="text-align: center;"><b>BF16</b></td>
73
+ </tr>
74
+ <tr>
75
+ <td style="text-align: center;">Fine-tuning Memory Usage</td>
76
+ <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
77
+ <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
78
+ <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
79
+ </tr>
80
+ <tr>
81
+ <td style="text-align: center;">Prompt Language</td>
82
+ <td colspan="3" style="text-align: center;">English*</td>
83
+ </tr>
84
+ <tr>
85
+ <td style="text-align: center;">Maximum Prompt Length</td>
86
+ <td colspan="3" style="text-align: center;">226 Tokens</td>
87
+ </tr>
88
+ <tr>
89
+ <td style="text-align: center;">Video Length</td>
90
+ <td colspan="3" style="text-align: center;">6 Seconds</td>
91
+ </tr>
92
+ <tr>
93
+ <td style="text-align: center;">Frame Rate</td>
94
+ <td colspan="3" style="text-align: center;">8 Frames / Second</td>
95
+ </tr>
96
+ <tr>
97
+ <td style="text-align: center;">Video Resolution</td>
98
+ <td colspan="3" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
99
+ </tr>
100
+ <tr>
101
+ <td style="text-align: center;">3d_sincos_pos_embed</td>
102
+ <td style="text-align: center;">3d_rope_pos_embed</td>
103
+ <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
104
+ </tr>
105
+ </table>
106
 
107
+ **Data Explanation**
108
+
109
+ + While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This
110
+ scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures.
111
+ Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are
112
+ disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table.
113
+ However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:
114
 
 
 
 
 
115
  ```
116
+ pipe.enable_sequential_cpu_offload()
117
+ pipe.vae.enable_slicing()
118
+ pipe.vae.enable_tiling()
 
119
  ```
120
+
121
+ + For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled.
122
+ + Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal
123
+ video quality loss, though inference speed will significantly decrease.
124
+ + The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision.
125
+ We recommend using the precision in which the model was trained for inference.
126
+ + [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
127
+ used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
128
+ allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
129
+ compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
130
+ devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
131
+ Python packages. CUDA 12.4 is recommended.
132
+ + The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
133
+ increases by about 10%. Only the `diffusers` version of the model supports quantization.
134
+ + The model only supports English input; other languages can be translated into English for use via large model
135
+ refinement.
136
+ + The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
137
+ uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
138
+ for fine-tuning.
139
+
140
+ **Reminders**
141
+
142
+ + Use [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning SAT version models. Feel free
143
+ to visit our GitHub for more details.
144
+
145
+ ## Getting Started Quickly 🤗
146
+
147
+ This model supports deployment using the Hugging Face diffusers library. You can follow the steps below to get started.
148
+
149
+ **We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) to check out prompt optimization and
150
+ conversion to get a better experience.**
151
+
152
+ 1. Install the required dependencies
153
+
154
+ ```shell
155
+ # diffusers>=0.30.3
156
+ # transformers>=0.45.0
157
+ # accelerate>=0.34.0
158
+ # imageio-ffmpeg>=0.5.1
159
+ pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
160
  ```
161
+
162
+ 2. Run the code (BF16 / FP16)
163
+
164
+ ```
165
+ import torch
166
+ from diffusers import CogVideoXImageToVideoPipeline
167
+ from diffusers.utils import export_to_video, load_image
168
+
169
+ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
170
+ image = load_image(image="panda.jpg")
171
+ pipe = CogVideoXImageToVideoPipeline.from_pretrained(
172
+ "THUDM/CogVideoX-5b-I2V",
173
+ torch_dtype=torch.bfloat16
174
+ )
175
+
176
+ pipe.enable_sequential_cpu_offload()
177
+ pipe.vae.enable_tiling()
178
+ pipe.vae.enable_slicing()
179
+
180
+ video = pipe(
181
+ prompt=prompt,
182
+ image=image,
183
+ num_videos_per_prompt=1,
184
+ num_inference_steps=50,
185
+ num_frames=49,
186
+ guidance_scale=6,
187
+ generator=torch.Generator(device="cuda").manual_seed(42),
188
+ ).frames[0]
189
+
190
+ export_to_video(video, "output.mp4", fps=8)
191
  ```
192
 
193
+ ## Quantized Inference
194
+
195
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
196
+ used to quantize the text encoder, transformer, and VAE modules to reduce CogVideoX's memory requirements. This allows
197
+ the model to run on free T4 Colab or GPUs with lower VRAM! Also, note that TorchAO quantization is fully compatible
198
+ with `torch.compile`, which can significantly accelerate inference.
199
+
200
+ ```
201
+ # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
202
+ # Source and nightly installation is only required until the next release.
203
+
204
+ import torch
205
+ from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
206
+ from diffusers.utils import export_to_video, load_image
207
+ from transformers import T5EncoderModel
208
+ from torchao.quantization import quantize_, int8_weight_only
209
+
210
+ quantization = int8_weight_only
211
+
212
+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b-I2V", subfolder="text_encoder", torch_dtype=torch.bfloat16)
213
+ quantize_(text_encoder, quantization())
214
+
215
+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b-I2V",subfolder="transformer", torch_dtype=torch.bfloat16)
216
+ quantize_(transformer, quantization())
217
+
218
+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b-I2V", subfolder="vae", torch_dtype=torch.bfloat16)
219
+ quantize_(vae, quantization())
220
+
221
+ # Create pipeline and run inference
222
+ pipe = CogVideoXImageToVideoPipeline.from_pretrained(
223
+ "THUDM/CogVideoX-5b-I2V",
224
+ text_encoder=text_encoder,
225
+ transformer=transformer,
226
+ vae=vae,
227
+ torch_dtype=torch.bfloat16,
228
+ )
229
+
230
+ pipe.enable_model_cpu_offload()
231
+ pipe.vae.enable_tiling()
232
+ pipe.vae.enable_slicing()
233
+
234
+ prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
235
+ image = load_image(image="input.jpg")
236
+ video = pipe(
237
+ prompt=prompt,
238
+ image=image,
239
+ num_videos_per_prompt=1,
240
+ num_inference_steps=50,
241
+ num_frames=49,
242
+ guidance_scale=6,
243
+ generator=torch.Generator(device="cuda").manual_seed(42),
244
+ ).frames[0]
245
+
246
+ export_to_video(video, "output.mp4", fps=8)
247
+ ```
248
+
249
+ Additionally, these models can be serialized and stored using PytorchAO in quantized data types to save disk space. You
250
+ can find examples and benchmarks at the following links:
251
+
252
+ - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
253
+ - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
254
+
255
+ ## Further Exploration
256
+
257
+ Feel free to enter our [GitHub](https://github.com/THUDM/CogVideo), where you'll find:
258
+
259
+ 1. More detailed technical explanations and code.
260
+ 2. Optimized prompt examples and conversions.
261
+ 3. Detailed code for model inference and fine-tuning.
262
+ 4. Project update logs and more interactive opportunities.
263
+ 5. CogVideoX toolchain to help you better use the model.
264
+ 6. INT8 model inference code.
265
+
266
+ ## Model License
267
+
268
+ This model is released under the [CogVideoX LICENSE](LICENSE).
269
+
270
+ ## Citation
271
+
272
+ ```
273
+ @article{yang2024cogvideox,
274
+ title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
275
+ author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
276
+ journal={arXiv preprint arXiv:2408.06072},
277
+ year={2024}
278
+ }
279
+ ```
README_zh.md ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CogVideoX-5B-I2V
2
+
3
+ <p style="text-align: center;">
4
+ <div align="center">
5
+ <img src=https://github.com/THUDM/CogVideo/raw/main/resources/logo.svg width="50%"/>
6
+ </div>
7
+ <p align="center">
8
+ <a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V/blob/main/README.md">📄 Read in English</a> |
9
+ <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space">🤗 Huggingface Space</a> |
10
+ <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
11
+ <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
12
+ </p>
13
+ <p align="center">
14
+ 📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验商业版视频生成模型
15
+ </p>
16
+
17
+ ## 模型介绍
18
+
19
+ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源版本视频生成模型。下表展示我们在本代提供的视频生成模型列表相关信息:
20
+
21
+ <table style="border-collapse: collapse; width: 100%;">
22
+ <tr>
23
+ <th style="text-align: center;">模型名</th>
24
+ <th style="text-align: center;">CogVideoX-2B</th>
25
+ <th style="text-align: center;">CogVideoX-5B</th>
26
+ <th style="text-align: center;">CogVideoX-5B-I2V (本仓库)</th>
27
+ </tr>
28
+ <tr>
29
+ <td style="text-align: center;">模型介绍</td>
30
+ <td style="text-align: center;">入门级模型,兼顾兼容性。运行,二次开发成本低。</td>
31
+ <td style="text-align: center;">视频生成质量更高,视觉效果更好的更大尺寸模型。</td>
32
+ <td style="text-align: center;">CogVideoX-5B 图生视频版本。</td>
33
+ </tr>
34
+ <tr>
35
+ <td style="text-align: center;">推理精度</td>
36
+ <td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32,FP8*,INT8,不支持INT4</td>
37
+ <td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32,FP8*,INT8,不支持INT4</td>
38
+ </tr>
39
+ <tr>
40
+ <td style="text-align: center;">单GPU显存消耗<br></td>
41
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
42
+ <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
43
+ </tr>
44
+ <tr>
45
+ <td style="text-align: center;">多GPU推理显存消耗</td>
46
+ <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
47
+ <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
48
+ </tr>
49
+ <tr>
50
+ <td style="text-align: center;">推理速度<br>(Step = 50, FP/BF16)</td>
51
+ <td style="text-align: center;">单卡A100: ~90秒<br>单卡H100: ~45秒</td>
52
+ <td colspan="2" style="text-align: center;">单卡A100: ~180秒<br>单卡H100: ~90秒</td>
53
+ </tr>
54
+ <tr>
55
+ <td style="text-align: center;">微调精度</td>
56
+ <td style="text-align: center;"><b>FP16</b></td>
57
+ <td colspan="2" style="text-align: center;"><b>BF16</b></td>
58
+ </tr>
59
+ <tr>
60
+ <td style="text-align: center;">微调显存消耗</td>
61
+ <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
62
+ <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
63
+ <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
64
+ </tr>
65
+ <tr>
66
+ <td style="text-align: center;">提示词语言</td>
67
+ <td colspan="3" style="text-align: center;">English*</td>
68
+ </tr>
69
+ <tr>
70
+ <td style="text-align: center;">提示词长度上限</td>
71
+ <td colspan="3" style="text-align: center;">226 Tokens</td>
72
+ </tr>
73
+ <tr>
74
+ <td style="text-align: center;">视频长度</td>
75
+ <td colspan="3" style="text-align: center;">6 秒</td>
76
+ </tr>
77
+ <tr>
78
+ <td style="text-align: center;">帧率</td>
79
+ <td colspan="3" style="text-align: center;">8 帧 / 秒 </td>
80
+ </tr>
81
+ <tr>
82
+ <td style="text-align: center;">视频分辨率</td>
83
+ <td colspan="3" style="text-align: center;">720 * 480,不支持其他分辨率(含微调)</td>
84
+ </tr>
85
+ <tr>
86
+ <td style="text-align: center;">位置编码</td>
87
+ <td style="text-align: center;">3d_sincos_pos_embed</td>
88
+ <td style="text-align: center;">3d_rope_pos_embed</td>
89
+ <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
90
+ </tr>
91
+ </table>
92
+
93
+ **数据解释**
94
+
95
+ + 使用 diffusers 库进行测试时,启用了全部`diffusers`库自带的优化,该方案未测试在非**NVIDIA A100 / H100**
96
+ 外的设备上的实际显存 / 内存占用。通常,该方案可以适配于所有 **NVIDIA 安培架构**
97
+ 以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。但速度提升3-4倍左右。你可以选择性的关闭部分优化,这些优化包括:
98
+
99
+ ```
100
+ pipe.enable_sequential_cpu_offload()
101
+ pipe.vae.enable_slicing()
102
+ pipe.vae.enable_tiling()
103
+ ```
104
+
105
+ + 多GPU推理时,需要关闭 `enable_sequential_cpu_offload()` 优化。
106
+ + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
107
+ + CogVideoX-2B 模型采用 `FP16` 精度训练, 搜有 CogVideoX-5B 模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
108
+ + [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
109
+ 可以用于量化文本编码器、Transformer 和 VAE 模块,以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
110
+ 上运行模型成为可能!同样值得注意的是,TorchAO 量化完全兼容 `torch.compile`,这可以显著提高推理速度。在 `NVIDIA H100`
111
+ 及以上设备上必须使用 `FP8` 精度,这需要源码安装 `torch`、`torchao`、`diffusers` 和 `accelerate` Python
112
+ 包。建议使用 `CUDA 12.4`。
113
+ + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
114
+ + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
115
+ + 模型微调所占用的显存是在 `8 * H100` 环境下进行测试,程序已经自动使用`Zero 2` 优化。表格中若有标注具体GPU数量则必须使用大于等于该数量的GPU进行微调。
116
+
117
+ **提醒**
118
+
119
+ + 使用 [SAT](https://github.com/THUDM/SwissArmyTransformer) 推理和微调SAT版本模型。欢迎前往我们的github查看。
120
+
121
+ ## 快速上手 🤗
122
+
123
+ 本模型已经支持使用 huggingface 的 diffusers 库进行部署,你可以按照以下步骤进行部署。
124
+
125
+ **我们推荐您进入我们的 [github](https://github.com/THUDM/CogVideo) 并查看相关的提示词优化和转换,以获得更好的体验。**
126
+
127
+ 1. 安装对应的依赖
128
+
129
+ ```shell
130
+ # diffusers>=0.30.3
131
+ # transformers>=0.45.0
132
+ # accelerate>=0.34.0
133
+ # imageio-ffmpeg>=0.5.1
134
+ pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
135
+ ```
136
+
137
+ 2. 运行代码 (BF16 / FP16)
138
+
139
+ ```python![img.png](img.png)
140
+ import torch
141
+ from diffusers import CogVideoXImageToVideoPipeline
142
+ from diffusers.utils import export_to_video, load_image
143
+
144
+ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
145
+ image = load_image(image="panda.jpg")
146
+ pipe = CogVideoXImageToVideoPipeline.from_pretrained(
147
+ "THUDM/CogVideoX-5b-I2V",
148
+ torch_dtype=torch.bfloat16
149
+ )
150
+
151
+ pipe.enable_sequential_cpu_offload()
152
+ pipe.vae.enable_tiling()
153
+ pipe.vae.enable_slicing()
154
+
155
+ video = pipe(
156
+ prompt=prompt,
157
+ image=image,
158
+ num_videos_per_prompt=1,
159
+ num_inference_steps=50,
160
+ num_frames=49,
161
+ guidance_scale=6,
162
+ generator=torch.Generator(device="cuda").manual_seed(42),
163
+ ).frames[0]
164
+
165
+ export_to_video(video, "output.mp4", fps=8)
166
+ ```
167
+
168
+ ## Quantized Inference
169
+
170
+ [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
171
+ 可以用于对文本编码器、Transformer 和 VAE 模块进行量化,从而降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或较小 VRAM 的
172
+ GPU 上运行该模型成为可能!值得注意的是,TorchAO 量化与 `torch.compile` 完全兼容,这可以显著加快推理速度。
173
+
174
+ ```diffuser
175
+ # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
176
+ # Source and nightly installation is only required until the next release.
177
+
178
+ import torch
179
+ from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
180
+ from diffusers.utils import export_to_video, load_image
181
+ from transformers import T5EncoderModel
182
+ from torchao.quantization import quantize_, int8_weight_only
183
+
184
+ quantization = int8_weight_only
185
+
186
+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b-I2V", subfolder="text_encoder", torch_dtype=torch.bfloat16)
187
+ quantize_(text_encoder, quantization())
188
+
189
+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b-I2V",subfolder="transformer", torch_dtype=torch.bfloat16)
190
+ quantize_(transformer, quantization())
191
+
192
+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b-I2V", subfolder="vae", torch_dtype=torch.bfloat16)
193
+ quantize_(vae, quantization())
194
+
195
+ # Create pipeline and run inference
196
+ pipe = CogVideoXImageToVideoPipeline.from_pretrained(
197
+ "THUDM/CogVideoX-5b-I2V",
198
+ text_encoder=text_encoder,
199
+ transformer=transformer,
200
+ vae=vae,
201
+ torch_dtype=torch.bfloat16,
202
+ )
203
+
204
+ pipe.enable_model_cpu_offload()
205
+ pipe.vae.enable_tiling()
206
+ pipe.vae.enable_slicing()
207
+
208
+ prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
209
+ image = load_image(image="input.jpg")
210
+ video = pipe(
211
+ prompt=prompt,
212
+ image=image,
213
+ num_videos_per_prompt=1,
214
+ num_inference_steps=50,
215
+ num_frames=49,
216
+ guidance_scale=6,
217
+ generator=torch.Generator(device="cuda").manual_seed(42),
218
+ ).frames[0]
219
+
220
+ export_to_video(video, "output.mp4", fps=8)
221
+ ```
222
+
223
+ 此外,这些模型可以通过使用PytorchAO以量化数据类型序列化并存储,从而节省磁盘空间。你可以在以下链接中找到示例和基准测试。
224
+
225
+ - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
226
+ - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
227
+
228
+ ## 深入研究
229
+
230
+ 欢迎进入我们的 [github](https://github.com/THUDM/CogVideo),你将获得:
231
+
232
+ 1. 更加详细的技术细节介绍和代码解释。
233
+ 2. 提示词的优化和转换。
234
+ 3. 模型推理和微调的详细代码。
235
+ 4. 项目更新日志动态,更多互动机会。
236
+ 5. CogVideoX 工具链,帮助您更好的使用模型。
237
+ 6. INT8 模型推理代码。
238
+
239
+ ## 模型协议
240
+
241
+ 该模型根据 [CogVideoX LICENSE](LICENSE) 许可证发布。
242
+
243
+ ## 引用
244
+
245
+ ```
246
+ @article{yang2024cogvideox,
247
+ title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
248
+ author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
249
+ journal={arXiv preprint arXiv:2408.06072},
250
+ year={2024}
251
+ }
252
+ ```