File size: 9,920 Bytes
87d40d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# InstructPix2Pix 

[InstructPix2Pix](https://arxiv.org/abs/2211.09800)λŠ” text-conditioned diffusion λͺ¨λΈμ΄ ν•œ 이미지에 νŽΈμ§‘μ„ λ”°λ₯Ό 수 μžˆλ„λ‘ νŒŒμΈνŠœλ‹ν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€. 이 방법을 μ‚¬μš©ν•˜μ—¬ νŒŒμΈνŠœλ‹λœ λͺ¨λΈμ€ λ‹€μŒμ„ μž…λ ₯으둜 μ‚¬μš©ν•©λ‹ˆλ‹€:

<p align="center">
    <img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png" alt="instructpix2pix-inputs" width=600/>
</p>

좜λ ₯은 μž…λ ₯ 이미지에 νŽΈμ§‘ μ§€μ‹œκ°€ 반영된 "μˆ˜μ •λœ" μ΄λ―Έμ§€μž…λ‹ˆλ‹€:

<p align="center">
    <img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/output-gs%407-igs%401-steps%4050.png" alt="instructpix2pix-output" width=600/>
</p>

`train_instruct_pix2pix.py` 슀크립트([μ—¬κΈ°](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py)μ—μ„œ 찾을 수 μžˆμŠ΅λ‹ˆλ‹€.)λŠ” ν•™μŠ΅ 절차λ₯Ό μ„€λͺ…ν•˜κ³  Stable Diffusion에 μ μš©ν•  수 μžˆλŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€.


*** `train_instruct_pix2pix.py`λŠ” [μ›λž˜ κ΅¬ν˜„](https://github.com/timothybrooks/instruct-pix2pix)에 μΆ©μ‹€ν•˜λ©΄μ„œ InstructPix2Pix ν•™μŠ΅ 절차λ₯Ό κ΅¬ν˜„ν•˜κ³  μžˆμ§€λ§Œ, [μ†Œκ·œλͺ¨ 데이터셋](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)μ—μ„œλ§Œ ν…ŒμŠ€νŠΈλ₯Ό ν–ˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” μ΅œμ’… 결과에 영ν–₯을 끼칠 수 μžˆμŠ΅λ‹ˆλ‹€. 더 λ‚˜μ€ κ²°κ³Όλ₯Ό μœ„ν•΄, 더 큰 λ°μ΄ν„°μ…‹μ—μ„œ 더 길게 ν•™μŠ΅ν•˜λŠ” 것을 ꢌμž₯ν•©λ‹ˆλ‹€. [μ—¬κΈ°](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)μ—μ„œ InstructPix2Pix ν•™μŠ΅μ„ μœ„ν•΄ 큰 데이터셋을 찾을 수 μžˆμŠ΅λ‹ˆλ‹€.
***

## PyTorch둜 λ‘œμ»¬μ—μ„œ μ‹€ν–‰ν•˜κΈ°

### 쒅속성(dependencies) μ„€μΉ˜ν•˜κΈ°

이 슀크립트λ₯Ό μ‹€ν–‰ν•˜κΈ° 전에, 라이브러리의 ν•™μŠ΅ 쒅속성을 μ„€μΉ˜ν•˜μ„Έμš”:

**μ€‘μš”**

μ΅œμ‹  λ²„μ „μ˜ 예제 슀크립트λ₯Ό μ„±κ³΅μ μœΌλ‘œ μ‹€ν–‰ν•˜κΈ° μœ„ν•΄, **μ›λ³ΈμœΌλ‘œλΆ€ν„° μ„€μΉ˜**ν•˜λŠ” 것과 예제 슀크립트λ₯Ό 자주 μ—…λ°μ΄νŠΈν•˜κ³  μ˜ˆμ œλ³„ μš”κ΅¬μ‚¬ν•­μ„ μ„€μΉ˜ν•˜κΈ° λ•Œλ¬Έμ— μ΅œμ‹  μƒνƒœλ‘œ μœ μ§€ν•˜λŠ” 것을 ꢌμž₯ν•©λ‹ˆλ‹€. 이λ₯Ό μœ„ν•΄, μƒˆλ‘œμš΄ 가상 ν™˜κ²½μ—μ„œ λ‹€μŒ μŠ€ν…μ„ μ‹€ν–‰ν•˜μ„Έμš”:

```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install -e .
```

cd λͺ…λ Ήμ–΄λ‘œ 예제 ν΄λ”λ‘œ μ΄λ™ν•˜μ„Έμš”.
```bash
cd examples/instruct_pix2pix
```

이제 μ‹€ν–‰ν•˜μ„Έμš”.
```bash
pip install -r requirements.txt
```

그리고 [πŸ€—Accelerate](https://github.com/huggingface/accelerate/) ν™˜κ²½μ—μ„œ μ΄ˆκΈ°ν™”ν•˜μ„Έμš”:

```bash
accelerate config
```

ν˜Ήμ€ ν™˜κ²½μ— λŒ€ν•œ 질문 없이 기본적인 accelerate ꡬ성을 μ‚¬μš©ν•˜λ €λ©΄ λ‹€μŒμ„ μ‹€ν–‰ν•˜μ„Έμš”.

```bash
accelerate config default
```

ν˜Ήμ€ μ‚¬μš© 쀑인 ν™˜κ²½μ΄ notebookκ³Ό 같은 λŒ€ν™”ν˜• μ‰˜μ€ μ§€μ›ν•˜μ§€ μ•ŠλŠ” κ²½μš°λŠ” λ‹€μŒ 절차λ₯Ό λ”°λΌμ£Όμ„Έμš”.

```python
from accelerate.utils import write_basic_config

write_basic_config()
```

### μ˜ˆμ‹œ

이전에 μ–ΈκΈ‰ν–ˆλ“―μ΄, ν•™μŠ΅μ„ μœ„ν•΄ [μž‘μ€ 데이터셋](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)을 μ‚¬μš©ν•  κ²ƒμž…λ‹ˆλ‹€. κ·Έ 데이터셋은 InstructPix2Pix λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ [μ›λž˜μ˜ 데이터셋](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)보닀 μž‘μ€ λ²„μ „μž…λ‹ˆλ‹€. μžμ‹ μ˜ 데이터셋을 μ‚¬μš©ν•˜κΈ° μœ„ν•΄, [ν•™μŠ΅μ„ μœ„ν•œ 데이터셋 λ§Œλ“€κΈ°](create_dataset) κ°€μ΄λ“œλ₯Ό μ°Έκ³ ν•˜μ„Έμš”.

`MODEL_NAME` ν™˜κ²½ λ³€μˆ˜(ν—ˆλΈŒ λͺ¨λΈ λ ˆν¬μ§€ν† λ¦¬ λ˜λŠ” λͺ¨λΈ κ°€μ€‘μΉ˜κ°€ ν¬ν•¨λœ 폴더 경둜)λ₯Ό μ§€μ •ν•˜κ³  [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) μΈμˆ˜μ— μ „λ‹¬ν•©λ‹ˆλ‹€. `DATASET_ID`에 데이터셋 이름을 지정해야 ν•©λ‹ˆλ‹€:


```bash
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATASET_ID="fusing/instructpix2pix-1000-samples"
```

μ§€κΈˆ, ν•™μŠ΅μ„ μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μŠ€ν¬λ¦½νŠΈλŠ” λ ˆν¬μ§€ν† λ¦¬μ˜ ν•˜μœ„ ν΄λ”μ˜ λͺ¨λ“  κ΅¬μ„±μš”μ†Œ(`feature_extractor`, `scheduler`, `text_encoder`, `unet` λ“±)λ₯Ό μ €μž₯ν•©λ‹ˆλ‹€.

```bash
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --enable_xformers_memory_efficient_attention \
    --resolution=256 --random_flip \
    --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
    --max_train_steps=15000 \
    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
    --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
    --mixed_precision=fp16 \
    --seed=42 \
    --push_to_hub
```


μΆ”κ°€μ μœΌλ‘œ, κ°€μ€‘μΉ˜μ™€ λ°”μ΄μ–΄μŠ€λ₯Ό ν•™μŠ΅ 과정에 λͺ¨λ‹ˆν„°λ§ν•˜μ—¬ 검증 좔둠을 μˆ˜ν–‰ν•˜λŠ” 것을 μ§€μ›ν•©λ‹ˆλ‹€. `report_to="wandb"`와 이 κΈ°λŠ₯을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```bash
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --enable_xformers_memory_efficient_attention \
    --resolution=256 --random_flip \
    --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
    --max_train_steps=15000 \
    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
    --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
    --mixed_precision=fp16 \
    --val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
    --validation_prompt="make the mountains snowy" \
    --seed=42 \
    --report_to=wandb \
    --push_to_hub
 ```

λͺ¨λΈ 디버깅에 μœ μš©ν•œ 이 평가 방법 ꢌμž₯ν•©λ‹ˆλ‹€. 이λ₯Ό μ‚¬μš©ν•˜κΈ° μœ„ν•΄ `wandb`λ₯Ό μ„€μΉ˜ν•˜λŠ” 것을 μ£Όλͺ©ν•΄μ£Όμ„Έμš”. `pip install wandb`둜 μ‹€ν–‰ν•΄ `wandb`λ₯Ό μ„€μΉ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

[μ—¬κΈ°](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), λͺ‡ 가지 평가 방법과 ν•™μŠ΅ νŒŒλΌλ―Έν„°λ₯Ό ν¬ν•¨ν•˜λŠ” μ˜ˆμ‹œλ₯Ό λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

 ***μ°Έκ³ : 원본 λ…Όλ¬Έμ—μ„œ, μ €μžλ“€μ€ 256x256 이미지 ν•΄μƒλ„λ‘œ ν•™μŠ΅ν•œ λͺ¨λΈλ‘œ 512x512와 같은 더 큰 ν•΄μƒλ„λ‘œ 잘 μΌλ°˜ν™”λ˜λŠ” 것을 λ³Ό 수 μžˆμ—ˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” ν•™μŠ΅μ— μ‚¬μš©ν•œ 큰 데이터셋을 μ‚¬μš©ν–ˆκΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.***

 ## λ‹€μˆ˜μ˜ GPU둜 ν•™μŠ΅ν•˜κΈ°

`accelerate`λŠ” μ›ν™œν•œ λ‹€μˆ˜μ˜ GPU둜 ν•™μŠ΅μ„ κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€. `accelerate`둜 λΆ„μ‚° ν•™μŠ΅μ„ μ‹€ν–‰ν•˜λŠ” [μ—¬κΈ°](https://huggingface.co/docs/accelerate/basic_tutorials/launch) μ„€λͺ…을 따라 ν•΄ μ£Όμ‹œκΈ° λ°”λžλ‹ˆλ‹€. μ˜ˆμ‹œμ˜ λͺ…λ Ήμ–΄ μž…λ‹ˆλ‹€:


```bash 
accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \
 --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \
 --dataset_name=sayakpaul/instructpix2pix-1000-samples \
 --use_ema \
 --enable_xformers_memory_efficient_attention \
 --resolution=512 --random_flip \
 --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
 --max_train_steps=15000 \
 --checkpointing_steps=5000 --checkpoints_total_limit=1 \
 --learning_rate=5e-05 --lr_warmup_steps=0 \
 --conditioning_dropout_prob=0.05 \
 --mixed_precision=fp16 \
 --seed=42 \
 --push_to_hub
```

 ## μΆ”λ‘ ν•˜κΈ°

일단 ν•™μŠ΅μ΄ μ™„λ£Œλ˜λ©΄, μΆ”λ‘  ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

 ```python
import PIL
import requests
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline

model_id = "your_model_id"  # <- 이λ₯Ό μˆ˜μ •ν•˜μ„Έμš”.
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
generator = torch.Generator("cuda").manual_seed(0)

url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png"


def download_image(url):
    image = PIL.Image.open(requests.get(url, stream=True).raw)
    image = PIL.ImageOps.exif_transpose(image)
    image = image.convert("RGB")
    return image


image = download_image(url)
prompt = "wipe out the lake"
num_inference_steps = 20
image_guidance_scale = 1.5
guidance_scale = 10

edited_image = pipe(
    prompt,
    image=image,
    num_inference_steps=num_inference_steps,
    image_guidance_scale=image_guidance_scale,
    guidance_scale=guidance_scale,
    generator=generator,
).images[0]
edited_image.save("edited_image.png")
```

ν•™μŠ΅ 슀크립트λ₯Ό μ‚¬μš©ν•΄ 얻은 μ˜ˆμ‹œμ˜ λͺ¨λΈ λ ˆν¬μ§€ν† λ¦¬λŠ” μ—¬κΈ° [sayakpaul/instruct-pix2pix](https://huggingface.co/sayakpaul/instruct-pix2pix)μ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

μ„±λŠ₯을 μœ„ν•œ 속도와 ν’ˆμ§ˆμ„ μ œμ–΄ν•˜κΈ° μœ„ν•΄ μ„Έ 가지 νŒŒλΌλ―Έν„°λ₯Ό μ‚¬μš©ν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€:

* `num_inference_steps`
* `image_guidance_scale`
* `guidance_scale`

특히, `image_guidance_scale`와 `guidance_scale`λŠ” μƒμ„±λœ("μˆ˜μ •λœ") μ΄λ―Έμ§€μ—μ„œ 큰 영ν–₯을 λ―ΈμΉ  수 μžˆμŠ΅λ‹ˆλ‹€.([μ—¬κΈ°](https://twitter.com/RisingSayak/status/1628392199196151808?s=20)μ˜ˆμ‹œλ₯Ό μ°Έκ³ ν•΄μ£Όμ„Έμš”.)


λ§Œμ•½ InstructPix2Pix ν•™μŠ΅ 방법을 μ‚¬μš©ν•΄ λͺ‡ 가지 ν₯미둜운 방법을 μ°Ύκ³  μžˆλ‹€λ©΄, 이 λΈ”λ‘œκ·Έ κ²Œμ‹œλ¬Ό[Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd)을 ν™•μΈν•΄μ£Όμ„Έμš”.