File size: 12,676 Bytes
c4c8bbb
 
c925975
 
 
c4c8bbb
 
 
18f9c18
c4c8bbb
 
 
 
 
 
 
 
18f9c18
 
 
 
 
 
c4c8bbb
 
 
 
 
18f9c18
 
 
 
c4c8bbb
 
 
 
 
 
 
 
 
 
18f9c18
c4c8bbb
 
 
 
 
18f9c18
 
 
 
c4c8bbb
 
 
 
 
 
 
18f9c18
c4c8bbb
 
 
 
 
 
 
 
 
 
 
 
18f9c18
 
c4c8bbb
 
 
 
18f9c18
c4c8bbb
 
18f9c18
 
 
c4c8bbb
 
18f9c18
c4c8bbb
18f9c18
c4c8bbb
18f9c18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c4c8bbb
 
18f9c18
c4c8bbb
18f9c18
c4c8bbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18f9c18
 
 
 
 
 
c4c8bbb
18f9c18
c4c8bbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18f9c18
 
 
 
c4c8bbb
 
 
18f9c18
c4c8bbb
18f9c18
c4c8bbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
license: apache-2.0
pipeline_tag: image-to-video
tags:
- human-animation
---
# StableAnimator

<a href='https://francis-rings.github.io/StableAnimator'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2411.17697'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/FrancisRing/StableAnimator/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://www.youtube.com/watch?v=7fwFyFDzQgg'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/BV1X5zyYUEuD'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>

StableAnimator: High-Quality Identity-Preserving Human Image Animation
<br/>
*Shuyuan Tu<sup>1</sup>, Zhen Xing<sup>1</sup>, Xintong Han<sup>3</sup>, Zhi-Qi Cheng<sup>4</sup>, Qi Dai<sup>2</sup>, Chong Luo<sup>2</sup>, Zuxuan Wu<sup>1</sup>*
<br/>
[<sup>1</sup>Fudan University; <sup>2</sup>Microsoft Research Asia; <sup>3</sup>Huya Inc; <sup>4</sup>Carnegie Mellon University]

<p align="center">
  <img src="assets/figures/case-47.gif" width="256" />
  <img src="assets/figures/case-61.gif" width="256" />
  <img src="assets/figures/case-45.gif" width="256" />
  <img src="assets/figures/case-46.gif" width="256" />
  <img src="assets/figures/case-5.gif" width="256" />
  <img src="assets/figures/case-17.gif" width="256" />
  <br/>
  <span>Pose-driven Human image animations generated by StableAnimator, showing its power to synthesize <b>high-fidelity</b> and <b>ID-preserving videos</b>. All animations are <b>directly synthesized by StableAnimator without the use of any face-related post-processing tools</b>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.</span>
</p>

<p align="center">
  <img src="assets/figures/case-35.gif" width="384" />
  <img src="assets/figures/case-42.gif" width="384" />
  <img src="assets/figures/case-18.gif" width="384" />
  <img src="assets/figures/case-24.gif" width="384" />
  <br/>
  <span>Comparison results between StableAnimator and state-of-the-art (SOTA) human image animation models highlight the superior performance of StableAnimator in delivering <b>high-fidelity, identity-preserving human image animation</b>.</span>
</p>


## Overview

<p align="center">
  <img src="assets/figures/framework.jpg" alt="model architecture" width="1280"/>
  </br>
  <i>The overview of the framework of StableAnimator.</i>
</p>

Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, <b>the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses.</b> Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.

## News
* `[2024-12-10]`:πŸ”₯ The gradio interface is released! Many thanks to [@gluttony-10](https://space.bilibili.com/893892) for his contribution! Other codes will be released very soon. Stay tuned!
* `[2024-12-6]`:πŸ”₯ All data preprocessing codes (human skeleton extraction and human face mask extraction) are released! The training code and detailed training tutorial will be released before 2024.12.13. Stay tuned!
* `[2024-12-4]`:πŸ”₯ We are thrilled to release an interesting dance demo (πŸ”₯πŸ”₯APT DanceπŸ”₯πŸ”₯)! The generated video can be seen on [YouTube](https://www.youtube.com/watch?v=KNPoAsWr_sk) and [Bilibili](https://www.bilibili.com/video/BV1KczXYhER7).
* `[2024-11-28]`:πŸ”₯ The data pre-processing codes (human skeleton extraction) are available! Other codes will be released very soon. Stay tuned!
* `[2024-11-26]`:πŸ”₯ The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/StableAnimator/tree/main) are released. Further training codes, data pre-processing codes, the evaluation dataset and StableAnimator-pro will be released very soon. Stay tuned!

## To-Do List
- [x] StableAnimator-basic 
- [x] Inference Code
- [x] Evaluation Samples
- [x] Data Pre-Processing Code (Skeleton Extraction)
- [x] Data Pre-Processing Code (Human Face Mask Extraction)
- [ ] Evaluation Dataset
- [ ] Training Code
- [ ] StableAnimator-pro
- [ ] Inference Code with HJB-based Face Optimization

## Quickstart

For the basic version of the model checkpoint, it supports generating videos at a 576x1024 or 512x512 resolution. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames.

### Environment setup

```
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install torch==2.5.1+cu124 xformers --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
```

### Download weights
If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
Please download weights manually as follows:
```
cd StableAnimator
git lfs install
git clone https://huggingface.co/FrancisRing/StableAnimator checkpoints
```
All the weights should be organized in models as follows
The overall file structure of this project should be organized as follows:
```
StableAnimator/
β”œβ”€β”€ DWPose
β”œβ”€β”€ animation
β”œβ”€β”€ checkpoints
β”‚Β Β  β”œβ”€β”€ DWPose
β”‚Β Β  β”‚Β   β”œβ”€β”€ dw-ll_ucoco_384.onnx
β”‚Β Β  β”‚Β Β  └── yolox_l.onnx
β”‚Β Β  β”œβ”€β”€ Animation
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ pose_net.pth
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ face_encoder.pth
β”‚Β Β  β”‚Β Β  └── unet.pth
β”‚Β Β  β”œβ”€β”€ SVD
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ feature_extractor
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ image_encoder
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ scheduler
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ unet
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ vae
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ model_index.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ svd_xt.safetensors
β”‚Β Β  β”‚Β Β  └── svd_xt_image_decoder.safetensors
β”‚Β Β  └── inference.zip
β”œβ”€β”€ models
β”‚   β”‚   └── antelopev2
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 1k3d68.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 2d106det.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ genderage.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ glintr100.onnx
β”‚Β Β  β”‚Β Β      └── scrfd_10g_bnkps.onnx
β”œβ”€β”€ app.py
β”œβ”€β”€ command_basic_infer.sh
β”œβ”€β”€ inference_basic.py
β”œβ”€β”€ requirement.txt 
```

### Evaluation Samples
The evaluation samples presented in the paper can be downloaded from [OneDrive](https://1drv.ms/f/c/becb962aad1a1f95/EubdzCAI7BFLhJff2LrHkt8BC9mOiwJ5V67t-ypxRnCK4Q?e=ElEmcn) or `inference.zip` in checkpoints. Please download evaluation samples manually as follows:
```
cd StableAnimator
mkdir inference
```
All the evaluation samples should be organized as follows:
```
inference/
β”œβ”€β”€ case-1
β”‚Β Β  β”œβ”€β”€ poses
β”‚Β Β  β”œβ”€β”€ faces
β”‚Β Β  └── reference.png
β”œβ”€β”€ case-2
β”‚Β Β  β”œβ”€β”€ poses
β”‚Β Β  β”œβ”€β”€ faces
β”‚Β Β  └── reference.png
β”œβ”€β”€ case-3
β”‚Β Β  β”œβ”€β”€ poses
β”‚Β Β  β”œβ”€β”€ faces
β”‚Β Β  └── reference.png
```

### Human Skeleton Extraction
We leverage the pre-trained DWPose to extract the human skeletons. In the initialization of DWPose, the pretrained weights should be configured in `/DWPose/dwpose_utils/wholebody.py`:
```
onnx_det = 'path/checkpoints/DWPose/yolox_l.onnx'
onnx_pose = 'path/checkpoints/DWPose/dw-ll_ucoco_384.onnx'
```
Given the target image folder containing multiple .png files, you can use the following command to obtain the corresponding human skeleton images:
```
python skeleton_extraction.py --target_image_folder_path="path/test/target_images" --ref_image_path="path/test/reference.png" --poses_folder_path="path/test/poses"
```
It is worth noting that the .png files in the target image folder are named in the format `frame_i.png`, such as `frame_0.png`, `frame_1.png`, and so on. 
`--ref_image_path` refers to the path of the given reference image. The obtained human skeleton images are saved in `path/test/poses`. It is particularly significant that the target skeleton images should be aligned with the reference image regarding the body shape.

If you only have the target MP4 file (target.mp4), we recommend you to use `ffmpeg` to convert the MP4 file to multiple frames (.png files) without any quality loss.
```
ffmpeg -i target.mp4 -q:v 1 path/test/target_images/frame_%d.png
```
The obtained frames are saved in `path/test/target_images`.

### Human Face Mask Extraction
Given the path to an image folder containing multiple RGB `.png` files, you can run the following command to extract the corresponding human face masks:
```
python face_mask_extraction.py --image_folder="path/StableAnimator/inference/your_case/target_images"
```
`path/StableAnimator/inference/your_case/target_images` contains multiple `.png` files. The obtained masks are saved in `path/StableAnimator/inference/your_case/faces`.

### Model inference
A sample configuration for testing is provided as `command_basic_infer.sh`. You can also easily modify the various configurations according to your needs.

```
bash command_basic_infer.sh
```
StableAnimator supports human image animation at two different resolution settings: 512x512 and 576x1024. You can modify "--width" and "--height" in `command_basic_infer.sh` to set the resolution of the animation. "--output_dir" in `command_basic_infer.sh` refers to the saved path of the generated animation. "--validation_control_folder" and "--validation_image" in `command_basic_infer.sh` refer to the paths of the given pose sequence and the reference image, respectively.
"--pretrained_model_name_or_path" in `command_basic_infer.sh` is the path of pretrained SVD. "posenet_model_name_or_path", "face_encoder_model_name_or_path", and "unet_model_name_or_path" in `command_basic_infer.sh` refer to paths of pretrained StableAnimator weights.
If you have enough GPU resources, you can increase the value (4=>8=>16) of "--decode_chunk_size" in `command_basic_infer.sh` to promote the temporal smoothness of the animation.

Tips: if your GPU memory is limited, you can reduce the number of animated frames. This command will generate two files: <b>animated_images</b> and <b>animated_images.gif</b>.
If you want to obtain the high quality MP4 file, we recommend you to leverage ffmpeg on the <b>animated_images</b> as follows:
```
cd animated_images
ffmpeg -framerate 20 -i frame_%d.png -c:v libx264 -crf 10 -pix_fmt yuv420p /path/animation.mp4
```
"-framerate" refers to the fps setting. "-crf" indicates the quality of the generated MP4 file, with smaller values corresponding to higher quality.
Additionally, you can also run the following command to launch a Gradio interface:
```
python app.py
```

### VRAM requirement and Runtime

For the 15s demo video (512x512, fps=30), the 16-frame basic model requires 8GB VRAM and finishes in 5 minutes on a 4090 GPU.

The minimum VRAM requirement for the 16-frame U-Net of the pro model is 10GB (576x1024, fps=30); however, the VAE decoder demands 16GB. You have the option to run the VAE decoder on CPU.

## Contact
If you have any suggestions or find our work helpful, feel free to contact me

Email: francisshuyuan@gmail.com

If you find our work useful, <b>please consider giving a star to this github repository and citing it</b>:
```bib
@article{tu2024stableanimator,
  title={StableAnimator: High-Quality Identity-Preserving Human Image Animation},
  author={Shuyuan Tu and Zhen Xing and Xintong Han and Zhi-Qi Cheng and Qi Dai and Chong Luo and Zuxuan Wu},
  journal={arXiv preprint arXiv:2411.17697},
  year={2024}
}
```