benjamin-paine
/

champ

Diffusers

Safetensors

CHAMPPipeline

Model card Files Files and versions Community

benjamin-paine commited on Mar 28, 2024

Commit

babeecd

verified ·

1 Parent(s): 03e71d4

Update README.md

Browse files

Files changed (1) hide show

README.md +32 -240

README.md CHANGED Viewed

@@ -1,35 +1,39 @@
 ---
 license: apache-2.0
 ---
-This repository contains a pruned and partially reorganized version of [AniPortrait](https://fudan-generative-vision.github.io/champ/#/).
 ```
-@misc{wei2024aniportrait,
-      title={AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations},
-      author={Huawei Wei and Zejun Yang and Zhisheng Wang},
       year={2024},
-      eprint={2403.17694},
       archivePrefix={arXiv},
       primaryClass={cs.CV}
 }
 ```
-# Usage
-## Installation
-First, install the AniPortrait package into your python environment. If you're creating a new environment for AniPortrait, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.
 ```sh
-pip install git+https://github.com/painebenjamin/aniportrait.git
 ```
 Now, you can create the pipeline, automatically pulling the weights from this repository, either as individual models:
 ```py
-from aniportrait import AniPortraitPipeline
-pipeline = AniPortraitPipeline.from_pretrained(
-  "benjamin-paine/aniportrait",
   torch_dtype=torch.float16,
   variant="fp16",
   device="cuda"
@@ -39,242 +43,30 @@ pipeline = AniPortraitPipeline.from_pretrained(
 Or, as a single file:
 ```py
-from aniportrait import AniPortraitPipeline
-pipeline = AniPortraitPipeline.from_single_file(
-  "benjamin-paine/aniportrait",
   torch_dtype=torch.float16,
   variant="fp16",
   device="cuda"
 ).to("cuda", dtype=torch.float16)
 ```
-The `AniPortraitPipeline` is a mega pipeline, capable of instantiating and executing other pipelines. It provides the following functions:
-## Workflows
-### img2img
-```py
-pipeline.img2img(
-    reference_image: PIL.Image.Image,
-    pose_reference_image: PIL.Image.Image,
-    num_inference_steps: int,
-    guidance_scale: float,
-    eta: float=0.0,
-    reference_pose_image: Optional[Image.Image]=None,
-    generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
-    output_type: Optional[str]="pil",
-    return_dict: bool=True,
-    callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
-    callback_steps: Optional[int]=None,
-    width: Optional[int]=None,
-    height: Optional[int]=None,
-    **kwargs: Any
-) -> Pose2VideoPipelineOutput
-```
-Using a reference image (for structure) and a pose reference image (for pose), render an image of the former in the pose of the latter.
-- The pose reference image here is an unprocessed image, from which the face pose will be extracted.
-- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
-### vid2vid
-```py
-pipeline.vid2vid(
-    reference_image: PIL.Image.Image,
-    pose_reference_images: List[PIL.Image.Image],
-    num_inference_steps: int,
-    guidance_scale: float,
-    eta: float=0.0,
-    reference_pose_image: Optional[Image.Image]=None,
-    generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
-    output_type: Optional[str]="pil",
-    return_dict: bool=True,
-    callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
-    callback_steps: Optional[int]=None,
-    width: Optional[int]=None,
-    height: Optional[int]=None,
-    video_length: Optional[int]=None,
-    context_schedule: str="uniform",
-    context_frames: int=16,
-    context_overlap: int=4,
-    context_batch_size: int=1,
-    interpolation_factor: int=1,
-    use_long_video: bool=True,
-    **kwargs: Any
-) -> Pose2VideoPipelineOutput
-```
-Using a reference image (for structure) and a sequence of pose reference images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation when the poses are longer than 16 frames.
-- Optionally pass `use_long_video = false` to disable using the long video pipeline.
-- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
-- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images.
-### audio2vid
-```py
-pipeline.audio2vid(
-    audio: str,
-    reference_image: PIL.Image.Image,
-    num_inference_steps: int,
-    guidance_scale: float,
-    fps: int=30,
-    eta: float=0.0,
-    reference_pose_image: Optional[Image.Image]=None,
-    pose_reference_images: Optional[List[PIL.Image.Image]]=None,
-    generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
-    output_type: Optional[str]="pil",
-    return_dict: bool=True,
-    callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
-    callback_steps: Optional[int]=None,
-    width: Optional[int]=None,
-    height: Optional[int]=None,
-    video_length: Optional[int]=None,
-    context_schedule: str="uniform",
-    context_frames: int=16,
-    context_overlap: int=4,
-    context_batch_size: int=1,
-    interpolation_factor: int=1,
-    use_long_video: bool=True,
-    **kwargs: Any
-) -> Pose2VideoPipelineOutput
-```
-Using an audio file, draw `fps` face pose images per second for the duration of the audio. Then, using those face pose images, render a video.
-- Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement.
-- Optionally pass width/height to modify the size. Defaults to reference image size.
-- Optionally pass `use_long_video = false` to disable using the long video pipeline.
-- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
-- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images.
-## Internals/Helpers
-### img2pose
-```py
-pipeline.img2pose(
-	reference_image: PIL.Image.Image,
-	width: Optional[int]=None,
-	height: Optional[int]=None
-) -> PIL.Image.Image
-```
-Detects face landmarks in an image and draws a face pose image.
-- Optionally modify the original width and height.
-### vid2pose
-```py
-pipeline.vid2pose(
-	reference_image: PIL.Image.Image,
-    retarget_image: Optional[PIL.Image.Image],
-	width: Optional[int]=None,
-	height: Optional[int]=None
-) -> List[PIL.Image.Image]
-```
-Detects face landmarks in a series of images and draws pose images.
-- Optionally modify the original width and height.
-- Optionally retarget to a different face position, useful for video-to-video tasks.
-### audio2pose
 ```py
-pipeline.audio2pose(
-    audio_path: str,
-    fps: int=30,
-    reference_image: Optional[PIL.Image.Image]=None,
-    pose_reference_images: Optional[List[PIL.Image.Image]]=None,
-    width: Optional[int]=None,
-	height: Optional[int]=None
-) -> List[PIL.Image.Image]
 ```
-Using an audio file, draw `fps` face pose images per second for the duration of the audio.
-- Optionally include a reference image to extract the face shape and initial position from. Default has a generic androgynous face shape.
-- Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement.
-- Optionally pass width/height to modify the size. Defaults to reference image size, then pose image sizes, then 256.
-### pose2img
-```py
-pipeline.pose2img(
-    reference_image: PIL.Image.Image,
-    pose_image: PIL.Image.Image,
-    num_inference_steps: int,
-    guidance_scale: float,
-    eta: float=0.0,
-    reference_pose_image: Optional[Image.Image]=None,
-    generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
-    output_type: Optional[str]="pil",
-    return_dict: bool=True,
-    callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
-    callback_steps: Optional[int]=None,
-    width: Optional[int]=None,
-    height: Optional[int]=None,
-    **kwargs: Any
-) -> Pose2VideoPipelineOutput
-```
-Using a reference image (for structure) and a pose image (for pose), render an image of the former in the pose of the latter.
-- The pose image here is a processed face pose. To pass a non-processed face pose, see `img2img`.
-- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
-### pose2vid
-```py
-pipeline.pose2vid(
-    reference_image: PIL.Image.Image,
-    pose_images: List[PIL.Image.Image],
-    num_inference_steps: int,
-    guidance_scale: float,
-    eta: float=0.0,
-    reference_pose_image: Optional[Image.Image]=None,
-    generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
-    output_type: Optional[str]="pil",
-    return_dict: bool=True,
-    callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
-    callback_steps: Optional[int]=None,
-    width: Optional[int]=None,
-    height: Optional[int]=None,
-    video_length: Optional[int]=None,
-    **kwargs: Any
-) -> Pose2VideoPipelineOutput
-```
-Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter.
-- The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`.
-- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
-- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images.
-### pose2vid_long
-```py
-pipeline.pose2vid_long(
-    reference_image: PIL.Image.Image,
-    pose_images: List[PIL.Image.Image],
-    num_inference_steps: int,
-    guidance_scale: float,
-    eta: float=0.0,
-    reference_pose_image: Optional[Image.Image]=None,
-    generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
-    output_type: Optional[str]="pil",
-    return_dict: bool=True,
-    callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
-    callback_steps: Optional[int]=None,
-    width: Optional[int]=None,
-    height: Optional[int]=None,
-    video_length: Optional[int]=None,
-    context_schedule: str="uniform",
-    context_frames: int=16,
-    context_overlap: int=4,
-    context_batch_size: int=1,
-    interpolation_factor: int=1,
-    **kwargs: Any
-) -> Pose2VideoPipelineOutput
-```
-Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation.
-- The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`.
-- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
-- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images.

 ---
 license: apache-2.0
 ---
+This repository contains a pruned and partially reorganized version of [CHAMP](https://fudan-generative-vision.github.io/champ/#/).
 ```
+@misc{zhu2024champ,
+      title={Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance},
+      author={Shenhao Zhu and Junming Leo Chen and Zuozhuo Dai and Yinghui Xu and Xun Cao and Yao Yao and Hao Zhu and Siyu Zhu},
       year={2024},
+      eprint={2403.14781},
       archivePrefix={arXiv},
       primaryClass={cs.CV}
 }
 ```
+<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64429aaf7feb866811b12f73/wZku1I_4L4VwWeXXKgXqb.mp4"></video>
+Video credit: [Polina Tankilevitch, Pexels](https://www.pexels.com/video/a-young-woman-dancing-hip-hop-3873100/)
+Image credit: [Andrea Piacquadio, Pexels](https://www.pexels.com/photo/man-in-black-jacket-wearing-black-headphones-3831645/)
+# Usage
+First, install the CHAMP package into your python environment. If you're creating a new environment for CHAMP, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.
 ```sh
+pip install git+https://github.com/painebenjamin/champ.git
 ```
 Now, you can create the pipeline, automatically pulling the weights from this repository, either as individual models:
 ```py
+from champ import CHAMPPipeline
+pipeline = CHAMPPipeline.from_pretrained(
+  "benjamin-paine/champ",
   torch_dtype=torch.float16,
   variant="fp16",
   device="cuda"
 Or, as a single file:
 ```py
+from champ import CHAMPPipeline
+pipeline = CHAMPPipeline.from_single_file(
+  "benjamin-paine/champ",
   torch_dtype=torch.float16,
   variant="fp16",
   device="cuda"
 ).to("cuda", dtype=torch.float16)
 ```
+Follow this format for execution:
 ```py
+result = pipeline(
+  reference: PIL.Image.Image,
+  guidance: Dict[str, List[PIL.Image.Image]],
+  width: int,
+  height: int,
+  video_length: int,
+  num_inference_steps: int,
+  guidance_scale: float
+).videos
+# Result is a list of PIL Images
 ```
+Starting values for `num_inference_steps` and `guidance_scale` are `20` and `3.5`, respectively.
+Guidance keys include `depth`, `normal`, `dwpose` and `semantic_map` (densepose.) This guide does not provide details on how to obtain those samples, but examples are available in [the git repository.](https://github.com/painebenjamin/champ/tree/master/example)