File size: 4,684 Bytes
0041a8d
0a4698a
0041a8d
 
 
 
7481fbf
 
 
ae10b8a
 
7481fbf
ae10b8a
a8765c2
dea7ace
0b5ab94
dea7ace
a8765c2
dea7ace
0b5ab94
ae10b8a
7481fbf
dea636d
 
 
 
 
 
ae10b8a
 
 
5d24017
 
a8765c2
ae71815
b97b251
a8765c2
ae71815
b97b251
a8765c2
ae71815
b97b251
a8765c2
ae71815
b97b251
a8765c2
ae71815
ae10b8a
0a4698a
ae10b8a
0a4698a
d441c11
a19eadb
0a4698a
 
 
d441c11
ae10b8a
 
 
 
 
 
 
 
 
 
 
e5be922
ae10b8a
 
 
 
 
 
 
 
 
 
 
 
 
7e89a7b
ae10b8a
 
7e89a7b
ae10b8a
 
 
 
 
 
28b120c
ae10b8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a4698a
 
 
a19eadb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: other
language:
- en
pipeline_tag: text-to-image
tags:
- stable-diffusion
- alimama-creative
library_name: diffusers
---

# SD3 ControlNet Inpainting

![SD3](images/sd3_compressed.png)

<center><i>a woman wearing a white jacket, black hat and black pants is standing in a field, the hat writes SD3</i></center>

![bucket_alibaba](images/bucket_ali_compressed.png )

<center><i>a person wearing a white shoe, carrying a white bucket with text "alibaba" on it</i></center>

Finetuned controlnet inpainting model based on sd3-medium, the inpainting model offers several advantages:

* Leveraging the SD3 16-channel VAE and high-resolution generation capability at 1024, the model effectively preserves the integrity of non-inpainting regions, including text.

* It is capable of generating text through inpainting.

* It demonstrates superior aesthetic performance in portrait generation.

Compared with [SDXL-Inpainting](https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1)

From left to right: Input image, Masked image, SDXL inpainting, Ours.

![0](images/0_compressed.png)
<center><i>a tiger sitting on a park bench</i></center>

![1](images/0r_compressed.png)
<center><i>a dog sitting on a park bench</i></center>

![2](images/1_compressed.png)
<center><i>a young woman wearing a blue and pink floral dress</i></center>

![3](images/3_compressed.png)
<center><i>a woman wearing a white jacket, black hat and black pants is standing in a field, the hat writes SD3</i></center>

![4](images/5_compressed.png)
<center><i>an air conditioner hanging on the bedroom wall</i></center>

# Using with Diffusers

Step1: Make sure you upgrade to the latest version of diffusers(>=0.29.2): pip install -U diffusers.

Step2: Download the two required Python files(pipeline_sd3_controlnet_inpainting.py and controlnet_sd3.py) from either the current repo or from [GitHub](https://github.com/JPlin/SD3-Controlnet-Inpainting).
(We will merge this Feature to official Diffusers.)

Step3: And then you can run demo.py or following:

``` python
from diffusers.utils import load_image, check_min_version
import torch

from pipeline_sd3_controlnet_inpainting import StableDiffusion3ControlNetInpaintingPipeline, one_image_and_mask
from controlnet_sd3 import SD3ControlNetModel

check_min_version("0.29.2")

# Build model
controlnet = SD3ControlNetModel.from_pretrained(
    "alimama-creative/SD3-Controlnet-Inpainting",
    use_safetensors=True,
)
pipe = StableDiffusion3ControlNetInpaintingPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.text_encoder.to(torch.float16)
pipe.controlnet.to(torch.float16)
pipe.to("cuda")

# Load image
image = load_image(
    "https://huggingface.co/alimama-creative/SD3-Controlnet-Inpainting/resolve/main/images/prod.png"
)
mask = load_image(
    "https://huggingface.co/alimama-creative/SD3-Controlnet-Inpainting/resolve/main/images/mask.jpeg"
)

# Set args
width = 1024
height = 1024
prompt="a woman wearing a white jacket, black hat and black pants is standing in a field, the hat writes SD3"
generator = torch.Generator(device="cuda").manual_seed(48)
input_dict = one_image_and_mask(image, mask, size=(width, height), latent_scale=pipe.vae_scale_factor, invert_mask = True)

# Inference
res_image = pipe(
    negative_prompt='deformed, distorted, disfigured, poorly drawn, bad anatomy, wrong anatomy, extra limb, missing limb, floating limbs, mutated hands and fingers, disconnected limbs, mutation, mutated, ugly, disgusting, blurry, amputation, NSFW',
    prompt=prompt,
    height=height,
    width=width,
    control_image= input_dict['pil_masked_image'],  # H, W, C,
    control_mask=input_dict["mask"] > 0.5,  # B,1,H,W
    num_inference_steps=28,
    generator=generator,
    controlnet_conditioning_scale=0.95,
    guidance_scale=7,
).images[0]

res_image.save(f'res.png')
```


## Training Detail

The model was trained on 12M laion2B and internal source images for 20k steps at resolution 1024x1024. 

* Mixed precision : FP16
* Learning rate : 1e-4
* Batch size : 192
* Timestep sampling mode : 'logit_normal'
* Loss : Flow Matching

## Limitation

Due to the fact that only 1024*1024 pixel resolution was used during the training phase, the inference performs best at this size, with other sizes yielding suboptimal results. We will initiate multi-resolution training in the future, and at that time, we will open-source the new weights.

## LICENSE
The model is based on SD3 finetuning; therefore, the license follows the original [SD3 license](https://huggingface.co/stabilityai/stable-diffusion-3-medium#license).