FrancisRing commited on
Commit
18f9c18
β€’
1 Parent(s): a4d5753

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -50
README.md CHANGED
@@ -3,7 +3,7 @@ license: apache-2.0
3
  ---
4
  # StableAnimator
5
 
6
- <a href='https://francis-rings.github.io/StableAnimator'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2411.17697'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/FrancisRing/StableAnimator/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://www.youtube.com/watch?v=7fwFyFDzQgg'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a>
7
 
8
  StableAnimator: High-Quality Identity-Preserving Human Image Animation
9
  <br/>
@@ -12,21 +12,21 @@ StableAnimator: High-Quality Identity-Preserving Human Image Animation
12
  [<sup>1</sup>Fudan University; <sup>2</sup>Microsoft Research Asia; <sup>3</sup>Huya Inc; <sup>4</sup>Carnegie Mellon University]
13
 
14
  <p align="center">
15
- <img src="assets/figures/case-47.gif" width="512" />
16
- <img src="assets/figures/case-61.gif" width="512" />
17
- <img src="assets/figures/case-45.gif" width="512" />
18
- <img src="assets/figures/case-46.gif" width="512" />
19
- <img src="assets/figures/case-5.gif" width="512" />
20
- <img src="assets/figures/case-17.gif" width="512" />
21
  <br/>
22
  <span>Pose-driven Human image animations generated by StableAnimator, showing its power to synthesize <b>high-fidelity</b> and <b>ID-preserving videos</b>. All animations are <b>directly synthesized by StableAnimator without the use of any face-related post-processing tools</b>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.</span>
23
  </p>
24
 
25
  <p align="center">
26
- <img src="assets/figures/case-35.gif" width="1024" />
27
- <img src="assets/figures/case-42.gif" width="1024" />
28
- <img src="assets/figures/case-18.gif" width="1024" />
29
- <img src="assets/figures/case-24.gif" width="1024" />
30
  <br/>
31
  <span>Comparison results between StableAnimator and state-of-the-art (SOTA) human image animation models highlight the superior performance of StableAnimator in delivering <b>high-fidelity, identity-preserving human image animation</b>.</span>
32
  </p>
@@ -37,13 +37,16 @@ StableAnimator: High-Quality Identity-Preserving Human Image Animation
37
  <p align="center">
38
  <img src="assets/figures/framework.jpg" alt="model architecture" width="1280"/>
39
  </br>
40
- <i>An overview of the framework of StableAnimator.</i>
41
  </p>
42
 
43
  Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, <b>the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses.</b> Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.
44
 
45
  ## News
46
- * `[2024-11-26]`:πŸ”₯ The data pre-processing codes (human skeleton extraction) are available! Other codes will be released very soon. Stay tuned!
 
 
 
47
  * `[2024-11-26]`:πŸ”₯ The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/StableAnimator/tree/main) are released. Further training codes, data pre-processing codes, the evaluation dataset and StableAnimator-pro will be released very soon. Stay tuned!
48
 
49
  ## To-Do List
@@ -51,7 +54,7 @@ Current diffusion models for human image animation struggle to ensure identity (
51
  - [x] Inference Code
52
  - [x] Evaluation Samples
53
  - [x] Data Pre-Processing Code (Skeleton Extraction)
54
- - [ ] Data Pre-Processing Code (Human Face Mask Extraction)
55
  - [ ] Evaluation Dataset
56
  - [ ] Training Code
57
  - [ ] StableAnimator-pro
@@ -63,52 +66,61 @@ For the basic version of the model checkpoint, it supports generating videos at
63
 
64
  ### Environment setup
65
 
66
- Recommend python 3+ with torch 2.x are validated with an Nvidia V100 GPU. We recommend you to utilize the docker image [2.1.0-cuda11.8-cudnn8-devel](https://hub.docker.com/layers/pytorch/pytorch/2.1.0-cuda11.8-cudnn8-devel/images/sha256-558b78b9a624969d54af2f13bf03fbad27907dbb6f09973ef4415d6ea24c80d9?context=explore) or [deeptimhe/ubuntu20.04-cuda11.3.1-python3.8-pytorch1.12:orig-sing-pytorch3d0.7.2](https://hub.docker.com/layers/deeptimhe/ubuntu20.04-cuda11.3.1-python3.8-pytorch1.12/orig-sing-pytorch3d0.7.2/images/sha256-023fbbc55df6d9feffc75a3fe2daba31e09ecc39c5dcc39a6cb64e5c6a7f9ca7?context=explore). Follow the commands below to install all the dependencies of StableAnimator:
67
-
68
  ```
 
 
69
  pip install -r requirements.txt
70
- conda install xformers -c xformers -y
71
- pip install onnxruntime-gpu==1.17.0 --index-url=https://pkgs.dev.azure.com/onnxruntime/onnxruntime/_packaging/onnxruntime-cuda-12/pypi/simple
72
  ```
73
 
74
  ### Download weights
75
- If you experience connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
76
  Please download weights manually as follows:
77
  ```
78
- cd StableAnimator/
79
- mkdir checkpoints
 
80
  ```
81
  All the weights should be organized in models as follows
 
82
  ```
83
- checkpoints/
84
  β”œβ”€β”€ DWPose
85
- β”‚Β Β  β”œβ”€β”€ dw-ll_ucoco_384.onnx
86
- β”‚Β Β  └── yolox_l.onnx
87
- β”œβ”€β”€Animation
88
- β”‚Β Β  β”œβ”€β”€ pose_net.pth
89
- β”‚Β Β  β”œβ”€β”€ face_encoder.pth
90
- β”‚Β Β  └── unet.pth
91
- β”œβ”€β”€SVD
92
- β”‚Β Β  └── stable-video-diffusion-img2vid-xt
93
- β”‚Β Β  Β Β  β”œβ”€β”€ feature_extractor
94
- β”‚Β Β  Β Β  β”œβ”€β”€ image_encoder
95
- β”‚Β Β  Β Β  β”œβ”€β”€ scheduler
96
- β”‚Β Β  Β Β  β”œβ”€β”€ unet
97
- β”‚Β Β  Β Β  β”œβ”€β”€ vae
98
- β”‚Β Β  Β Β  β”œβ”€β”€ model_index.json
99
- β”‚Β Β  Β Β  β”œβ”€β”€ svd_xt.safetensors
100
- β”‚Β Β  Β Β  └── svd_xt_image_decoder.safetensors
101
-
102
- ```
103
- 1. Download DWPose pretrained model: [dwpose](https://huggingface.co/FrancisRing/StableAnimator/tree/main/DWPose)
104
- 2. Download the pre-trained checkpoint of StableAnimator from [Huggingface](https://huggingface.co/FrancisRing/StableAnimator/tree/main/Animation)
105
- 3. Download the SVD pretrained model: [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/tree/main)
106
-
 
 
 
 
 
 
 
 
 
107
 
108
  ### Evaluation Samples
109
- The evaluation samples presented in the paper can be downloaded from [OneDrive](https://1drv.ms/f/c/becb962aad1a1f95/EubdzCAI7BFLhJff2LrHkt8BC9mOiwJ5V67t-ypxRnCK4Q?e=ElEmcn). Please download evaluation samples manually as follows:
110
  ```
111
- cd StableAnimator/
112
  mkdir inference
113
  ```
114
  All the evaluation samples should be organized as follows:
@@ -127,7 +139,6 @@ inference/
127
  β”‚Β Β  β”œβ”€β”€ faces
128
  β”‚Β Β  └── reference.png
129
  ```
130
- It is worth noting that the data pre-processing codes (human face mask extraction) will be released very soon. Stay tuned!
131
 
132
  ### Human Skeleton Extraction
133
  We leverage the pre-trained DWPose to extract the human skeletons. In the initialization of DWPose, the pretrained weights should be configured in `/DWPose/dwpose_utils/wholebody.py`:
@@ -148,8 +159,14 @@ ffmpeg -i target.mp4 -q:v 1 path/test/target_images/frame_%d.png
148
  ```
149
  The obtained frames are saved in `path/test/target_images`.
150
 
151
- ### Model inference
 
 
 
 
 
152
 
 
153
  A sample configuration for testing is provided as `command_basic_infer.sh`. You can also easily modify the various configurations according to your needs.
154
 
155
  ```
@@ -166,12 +183,16 @@ cd animated_images
166
  ffmpeg -framerate 20 -i frame_%d.png -c:v libx264 -crf 10 -pix_fmt yuv420p /path/animation.mp4
167
  ```
168
  "-framerate" refers to the fps setting. "-crf" indicates the quality of the generated MP4 file, with smaller values corresponding to higher quality.
 
 
 
 
169
 
170
  ### VRAM requirement and Runtime
171
 
172
- For the 15s demo video, the 16-frame basic model requires 18GB VRAM and finishes in 12 minutes on a 4090 GPU.
173
 
174
- The minimum VRAM requirement for the 16-frame U-Net model is 10GB; however, the VAE decoder demands 16GB. You have the option to run the VAE decoder on CPU.
175
 
176
  ## Contact
177
  If you have any suggestions or find our work helpful, feel free to contact me
 
3
  ---
4
  # StableAnimator
5
 
6
+ <a href='https://francis-rings.github.io/StableAnimator'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2411.17697'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/FrancisRing/StableAnimator/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://www.youtube.com/watch?v=7fwFyFDzQgg'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/BV1X5zyYUEuD'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>
7
 
8
  StableAnimator: High-Quality Identity-Preserving Human Image Animation
9
  <br/>
 
12
  [<sup>1</sup>Fudan University; <sup>2</sup>Microsoft Research Asia; <sup>3</sup>Huya Inc; <sup>4</sup>Carnegie Mellon University]
13
 
14
  <p align="center">
15
+ <img src="assets/figures/case-47.gif" width="256" />
16
+ <img src="assets/figures/case-61.gif" width="256" />
17
+ <img src="assets/figures/case-45.gif" width="256" />
18
+ <img src="assets/figures/case-46.gif" width="256" />
19
+ <img src="assets/figures/case-5.gif" width="256" />
20
+ <img src="assets/figures/case-17.gif" width="256" />
21
  <br/>
22
  <span>Pose-driven Human image animations generated by StableAnimator, showing its power to synthesize <b>high-fidelity</b> and <b>ID-preserving videos</b>. All animations are <b>directly synthesized by StableAnimator without the use of any face-related post-processing tools</b>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.</span>
23
  </p>
24
 
25
  <p align="center">
26
+ <img src="assets/figures/case-35.gif" width="384" />
27
+ <img src="assets/figures/case-42.gif" width="384" />
28
+ <img src="assets/figures/case-18.gif" width="384" />
29
+ <img src="assets/figures/case-24.gif" width="384" />
30
  <br/>
31
  <span>Comparison results between StableAnimator and state-of-the-art (SOTA) human image animation models highlight the superior performance of StableAnimator in delivering <b>high-fidelity, identity-preserving human image animation</b>.</span>
32
  </p>
 
37
  <p align="center">
38
  <img src="assets/figures/framework.jpg" alt="model architecture" width="1280"/>
39
  </br>
40
+ <i>The overview of the framework of StableAnimator.</i>
41
  </p>
42
 
43
  Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, <b>the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses.</b> Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.
44
 
45
  ## News
46
+ * `[2024-12-10]`:πŸ”₯ The gradio interface is released! Many thanks to [@gluttony-10](https://space.bilibili.com/893892) for his contribution! Other codes will be released very soon. Stay tuned!
47
+ * `[2024-12-6]`:πŸ”₯ All data preprocessing codes (human skeleton extraction and human face mask extraction) are released! The training code and detailed training tutorial will be released before 2024.12.13. Stay tuned!
48
+ * `[2024-12-4]`:πŸ”₯ We are thrilled to release an interesting dance demo (πŸ”₯πŸ”₯APT DanceπŸ”₯πŸ”₯)! The generated video can be seen on [YouTube](https://www.youtube.com/watch?v=KNPoAsWr_sk) and [Bilibili](https://www.bilibili.com/video/BV1KczXYhER7).
49
+ * `[2024-11-28]`:πŸ”₯ The data pre-processing codes (human skeleton extraction) are available! Other codes will be released very soon. Stay tuned!
50
  * `[2024-11-26]`:πŸ”₯ The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/StableAnimator/tree/main) are released. Further training codes, data pre-processing codes, the evaluation dataset and StableAnimator-pro will be released very soon. Stay tuned!
51
 
52
  ## To-Do List
 
54
  - [x] Inference Code
55
  - [x] Evaluation Samples
56
  - [x] Data Pre-Processing Code (Skeleton Extraction)
57
+ - [x] Data Pre-Processing Code (Human Face Mask Extraction)
58
  - [ ] Evaluation Dataset
59
  - [ ] Training Code
60
  - [ ] StableAnimator-pro
 
66
 
67
  ### Environment setup
68
 
 
 
69
  ```
70
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
71
+ pip install torch==2.5.1+cu124 xformers --index-url https://download.pytorch.org/whl/cu124
72
  pip install -r requirements.txt
 
 
73
  ```
74
 
75
  ### Download weights
76
+ If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
77
  Please download weights manually as follows:
78
  ```
79
+ cd StableAnimator
80
+ git lfs install
81
+ git clone https://huggingface.co/FrancisRing/StableAnimator checkpoints
82
  ```
83
  All the weights should be organized in models as follows
84
+ The overall file structure of this project should be organized as follows:
85
  ```
86
+ StableAnimator/
87
  β”œβ”€β”€ DWPose
88
+ β”œβ”€β”€ animation
89
+ β”œβ”€β”€ checkpoints
90
+ β”‚Β Β  β”œβ”€β”€ DWPose
91
+ β”‚Β Β  β”‚Β  β”œβ”€β”€ dw-ll_ucoco_384.onnx
92
+ β”‚Β Β  β”‚Β Β  └── yolox_l.onnx
93
+ β”‚Β Β  β”œβ”€β”€ Animation
94
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ pose_net.pth
95
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ face_encoder.pth
96
+ β”‚Β Β  β”‚Β Β  └── unet.pth
97
+ β”‚Β Β  β”œβ”€β”€ SVD
98
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ feature_extractor
99
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ image_encoder
100
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ scheduler
101
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ unet
102
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ vae
103
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ model_index.json
104
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ svd_xt.safetensors
105
+ β”‚Β Β  β”‚Β Β  └── svd_xt_image_decoder.safetensors
106
+ β”‚Β Β  └── inference.zip
107
+ β”œβ”€β”€ models
108
+ β”‚ β”‚ └── antelopev2
109
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 1k3d68.onnx
110
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 2d106det.onnx
111
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ genderage.onnx
112
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ glintr100.onnx
113
+ β”‚Β Β  β”‚Β Β  └── scrfd_10g_bnkps.onnx
114
+ β”œβ”€β”€ app.py
115
+ β”œβ”€β”€ command_basic_infer.sh
116
+ β”œβ”€β”€ inference_basic.py
117
+ β”œβ”€β”€ requirement.txt
118
+ ```
119
 
120
  ### Evaluation Samples
121
+ The evaluation samples presented in the paper can be downloaded from [OneDrive](https://1drv.ms/f/c/becb962aad1a1f95/EubdzCAI7BFLhJff2LrHkt8BC9mOiwJ5V67t-ypxRnCK4Q?e=ElEmcn) or `inference.zip` in checkpoints. Please download evaluation samples manually as follows:
122
  ```
123
+ cd StableAnimator
124
  mkdir inference
125
  ```
126
  All the evaluation samples should be organized as follows:
 
139
  β”‚Β Β  β”œβ”€β”€ faces
140
  β”‚Β Β  └── reference.png
141
  ```
 
142
 
143
  ### Human Skeleton Extraction
144
  We leverage the pre-trained DWPose to extract the human skeletons. In the initialization of DWPose, the pretrained weights should be configured in `/DWPose/dwpose_utils/wholebody.py`:
 
159
  ```
160
  The obtained frames are saved in `path/test/target_images`.
161
 
162
+ ### Human Face Mask Extraction
163
+ Given the path to an image folder containing multiple RGB `.png` files, you can run the following command to extract the corresponding human face masks:
164
+ ```
165
+ python face_mask_extraction.py --image_folder="path/StableAnimator/inference/your_case/target_images"
166
+ ```
167
+ `path/StableAnimator/inference/your_case/target_images` contains multiple `.png` files. The obtained masks are saved in `path/StableAnimator/inference/your_case/faces`.
168
 
169
+ ### Model inference
170
  A sample configuration for testing is provided as `command_basic_infer.sh`. You can also easily modify the various configurations according to your needs.
171
 
172
  ```
 
183
  ffmpeg -framerate 20 -i frame_%d.png -c:v libx264 -crf 10 -pix_fmt yuv420p /path/animation.mp4
184
  ```
185
  "-framerate" refers to the fps setting. "-crf" indicates the quality of the generated MP4 file, with smaller values corresponding to higher quality.
186
+ Additionally, you can also run the following command to launch a Gradio interface:
187
+ ```
188
+ python app.py
189
+ ```
190
 
191
  ### VRAM requirement and Runtime
192
 
193
+ For the 15s demo video (512x512, fps=30), the 16-frame basic model requires 8GB VRAM and finishes in 5 minutes on a 4090 GPU.
194
 
195
+ The minimum VRAM requirement for the 16-frame U-Net of the pro model is 10GB (576x1024, fps=30); however, the VAE decoder demands 16GB. You have the option to run the VAE decoder on CPU.
196
 
197
  ## Contact
198
  If you have any suggestions or find our work helpful, feel free to contact me