THUDM
/

CogVideoX-5b

@@ -23,6 +23,9 @@ inference: false
   <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
   <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
 </p>
 ## Demo Show
@@ -109,7 +112,9 @@ inference: false
 ## Model Introduction
-CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
 <table style="border-collapse: collapse; width: 100%;">
   <tr>
@@ -128,9 +133,9 @@ CogVideoX is an open-source version of the video generation model originating fr
     <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Single GPU VRAM Consumption</td>
-    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
-    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
   </tr>
   <tr>
     <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
@@ -181,13 +186,34 @@ CogVideoX is an open-source version of the video generation model originating fr
 **Data Explanation**
-- When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()` optimization were enabled. This solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100/H100**. Generally, this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If optimization is disabled, VRAM usage will increase significantly, with peak VRAM approximately 3 times the value in the table.
-- When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
-- Using an INT8 model will result in reduced inference speed. This is done to accommodate GPUs with lower VRAM, allowing inference to run properly with minimal video quality loss, though the inference speed will be significantly reduced.
-- The 2B model is trained using `FP16` precision, while the 5B model is trained using `BF16` precision. It is recommended to use the precision used in model training for inference.
-- `FP8` precision must be used on `NVIDIA H100` and above devices, requiring source installation of the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages. `CUDA 12.4` is recommended.
-- Inference speed testing also used the aforementioned VRAM optimization scheme. Without VRAM optimization, inference speed increases by about 10%. Only models using `diffusers` support quantization.
-- The model only supports English input; other languages can be translated to English during large model refinements.
 **Note**
@@ -242,7 +268,10 @@ export_to_video(video, "output.mp4", fps=8)
 ## Quantized Inference
-[PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
 ```diff
 # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
@@ -290,11 +319,12 @@ video = pipe(
 export_to_video(video, "output.mp4", fps=8)
 ```
-Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO. Find examples and benchmarks at these links:
 - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
 - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
 ## Explore the Model
 Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:

   <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
   <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
 </p>
+<p align="center">
+📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience commercial video generation models.
+</p>
 ## Demo Show
 ## Model Introduction
+CogVideoX is an open-source version of the video generation model originating
+from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation
+models we currently offer, along with their foundational information.
 <table style="border-collapse: collapse; width: 100%;">
   <tr>
     <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
   </tr>
   <tr>
+    <td style="text-align: center;">Single GPU VRAM Consumption<br></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: starting from 4GB*</b><br><b>diffusers INT8(torchao): starting from 3.6GB*</b></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: starting from 5GB*</b><br><b>diffusers INT8(torchao): starting from 4.4GB*</b></td>
   </tr>
   <tr>
     <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
 **Data Explanation**
++ When testing using the `diffusers` library, all optimizations provided by the `diffusers` library were enabled. This
+  solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100 / H100**. Generally,
+  this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If the optimizations are
+  disabled, VRAM usage will increase significantly, with peak VRAM usage being about 3 times higher than the table
+  shows. However, speed will increase by 3-4 times. You can selectively disable some optimizations, including:
+```
+pipe.enable_model_cpu_offload()
+pipe.enable_sequential_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+```
++ When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
++ Using INT8 models will reduce inference speed. This is to ensure that GPUs with lower VRAM can perform inference
+  normally while maintaining minimal video quality loss, though inference speed will decrease significantly.
++ The 2B model is trained with `FP16` precision, and the 5B model is trained with `BF16` precision. We recommend using
+  the precision the model was trained with for inference.
++ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
+  used to quantize the text encoder, Transformer, and VAE modules to reduce CogVideoX's memory requirements. This makes
+  it possible to run the model on a free T4 Colab or GPUs with smaller VRAM! It is also worth noting that TorchAO
+  quantization is fully compatible with `torch.compile`, which can significantly improve inference speed. `FP8`
+  precision must be used on devices with `NVIDIA H100` or above, which requires installing
+  the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages from source. `CUDA 12.4` is recommended.
++ The inference speed test also used the above VRAM optimization scheme. Without VRAM optimization, inference speed
+  increases by about 10%. Only the `diffusers` version of the model supports quantization.
++ The model only supports English input; other languages can be translated into English during refinement by a large
+  model.
 **Note**
 ## Quantized Inference
+[PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
+used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes
+it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO
+quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
 ```diff
 # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
 export_to_video(video, "output.mp4", fps=8)
 ```
+Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO.
+Find examples and benchmarks at these links:
 - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
 - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
 ## Explore the Model
 Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:

README_zh.md CHANGED Viewed

@@ -10,6 +10,9 @@
   <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
   <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
 </p>
 ## 作品案例
@@ -116,8 +119,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源
   </tr>
   <tr>
     <td style="text-align: center;">单GPU显存消耗<br></td>
-    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers  with torchao</b></td>
-    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers  with torchao</b></td>
   </tr>
   <tr>
     <td style="text-align: center;">多GPU推理显存消耗</td>
@@ -168,13 +171,23 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源
 **数据解释**
-+ 使用 diffusers 库进行测试时，启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化，该方案未测试在非
-  **NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常，该方案可以适配于所有 **NVIDIA 安培架构**
-  以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。
 + 多GPU推理时，需要关闭 `enable_model_cpu_offload()` 优化。
 + 使用 INT8 模型会导致推理速度降低，此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失，推理速度大幅降低。
 + 2B 模型采用 `FP16` 精度训练， 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
-+ `FP8` 精度必须在`NVIDIA H100` 及以上的设备上使用，需要源代码安装`torch`,`torchao`,`diffusers`,`accelerate` python包，推荐使用 `CUDA 12.4`。
 + 推理速度测试同样采用了上述显存优化方案，不采用显存优化的情况下，推理速度提升约10%。 只有`diffusers`版本模型支持量化。
 + 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。

   <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
   <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
 </p>
+<p align="center">
+📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验商业版视频生成模型
+</p>
 ## 作品案例
   </tr>
   <tr>
     <td style="text-align: center;">单GPU显存消耗<br></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
   </tr>
   <tr>
     <td style="text-align: center;">多GPU推理显存消耗</td>
 **数据解释**
++ 使用 diffusers 库进行测试时，启用了全部`diffusers`库自带的优化，该方案未测试在非**NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常，该方案可以适配于所有 **NVIDIA 安培架构**
+以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。但速度提升3-4倍左右。你可以选择性的关闭部分优化，这些优化包括:
+```
+pipe.enable_model_cpu_offload()
+pipe.enable_sequential_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+```
 + 多GPU推理时，需要关闭 `enable_model_cpu_offload()` 优化。
 + 使用 INT8 模型会导致推理速度降低，此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失，推理速度大幅降低。
 + 2B 模型采用 `FP16` 精度训练， 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
++ [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
+  可以用于量化文本编码器、Transformer 和 VAE 模块，以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
+  上运行模型成为可能！同样值得注意的是，TorchAO 量化完全兼容 `torch.compile`，这可以显著提高推理速度。在 `NVIDIA H100`
+  及以上设备上必须使用 `FP8` 精度，这需要源码安装 `torch`、`torchao`、`diffusers` 和 `accelerate` Python
+  包。建议使用 `CUDA 12.4`。
 + 推理速度测试同样采用了上述显存优化方案，不采用显存优化的情况下，推理速度提升约10%。 只有`diffusers`版本模型支持量化。
 + 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。