BleachNick's picture
upload required packages
87d40d2
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ๋ฉ”๋ชจ๋ฆฌ์™€ ์†๋„
๋ฉ”๋ชจ๋ฆฌ ๋˜๋Š” ์†๋„์— ๋Œ€ํ•ด ๐Ÿค— Diffusers *์ถ”๋ก *์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ๊ธฐ์ˆ ๊ณผ ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
์ผ๋ฐ˜์ ์œผ๋กœ, memory-efficient attention์„ ์œ„ํ•ด [xFormers](https://github.com/facebookresearch/xformers) ์‚ฌ์šฉ์„ ์ถ”์ฒœํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ถ”์ฒœํ•˜๋Š” [์„ค์น˜ ๋ฐฉ๋ฒ•](xformers)์„ ๋ณด๊ณ  ์„ค์น˜ํ•ด ๋ณด์„ธ์š”.
๋‹ค์Œ ์„ค์ •์ด ์„ฑ๋Šฅ๊ณผ ๋ฉ”๋ชจ๋ฆฌ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
| | ์ง€์—ฐ์‹œ๊ฐ„ | ์†๋„ ํ–ฅ์ƒ |
| ---------------- | ------- | ------- |
| ๋ณ„๋„ ์„ค์ • ์—†์Œ | 9.50s | x1 |
| cuDNN auto-tuner | 9.37s | x1.01 |
| fp16 | 3.61s | x2.63 |
| Channels Last ๋ฉ”๋ชจ๋ฆฌ ํ˜•์‹ | 3.30s | x2.88 |
| traced UNet | 3.21s | x2.96 |
| memory-efficient attention | 2.63s | x3.61 |
<em>
NVIDIA TITAN RTX์—์„œ 50 DDIM ์Šคํ…์˜ "a photo of an astronaut riding a horse on mars" ํ”„๋กฌํ”„ํŠธ๋กœ 512x512 ํฌ๊ธฐ์˜ ๋‹จ์ผ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.
</em>
## cuDNN auto-tuner ํ™œ์„ฑํ™”ํ•˜๊ธฐ
[NVIDIA cuDNN](https://developer.nvidia.com/cudnn)์€ ์ปจ๋ณผ๋ฃจ์…˜์„ ๊ณ„์‚ฐํ•˜๋Š” ๋งŽ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Autotuner๋Š” ์งง์€ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ฃผ์–ด์ง„ ์ž…๋ ฅ ํฌ๊ธฐ์— ๋Œ€ํ•ด ์ฃผ์–ด์ง„ ํ•˜๋“œ์›จ์–ด์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๊ฐ€์ง„ ์ปค๋„์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
**์ปจ๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ**๋ฅผ ํ™œ์šฉํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— (๋‹ค๋ฅธ ์œ ํ˜•๋“ค์€ ํ˜„์žฌ ์ง€์›๋˜์ง€ ์•Š์Œ), ๋‹ค์Œ ์„ค์ •์„ ํ†ตํ•ด ์ถ”๋ก  ์ „์— cuDNN autotuner๋ฅผ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
import torch
torch.backends.cudnn.benchmark = True
```
### fp32 ๋Œ€์‹  tf32 ์‚ฌ์šฉํ•˜๊ธฐ (Ampere ๋ฐ ์ดํ›„ CUDA ์žฅ์น˜๋“ค์—์„œ)
Ampere ๋ฐ ์ดํ›„ CUDA ์žฅ์น˜์—์„œ ํ–‰๋ ฌ๊ณฑ ๋ฐ ์ปจ๋ณผ๋ฃจ์…˜์€ TensorFloat32(TF32) ๋ชจ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋น ๋ฅด์ง€๋งŒ ์•ฝ๊ฐ„ ๋œ ์ •ํ™•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ๋ณธ์ ์œผ๋กœ PyTorch๋Š” ์ปจ๋ณผ๋ฃจ์…˜์— ๋Œ€ํ•ด TF32 ๋ชจ๋“œ๋ฅผ ํ™œ์„ฑํ™”ํ•˜์ง€๋งŒ ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ํ™œ์„ฑํ™”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
๋„คํŠธ์›Œํฌ์— ์™„์ „ํ•œ float32 ์ •๋ฐ€๋„๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ์•„๋‹ˆ๋ฉด ํ–‰๋ ฌ ๊ณฑ์…ˆ์— ๋Œ€ํ•ด์„œ๋„ ์ด ์„ค์ •์„ ํ™œ์„ฑํ™”ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
์ด๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜์น˜์˜ ์ •ํ™•๋„ ์†์‹ค์ด ์žˆ์ง€๋งŒ, ๊ณ„์‚ฐ ์†๋„๋ฅผ ํฌ๊ฒŒ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๊ฒƒ์— ๋Œ€ํ•ด [์—ฌ๊ธฐ](https://huggingface.co/docs/transformers/v4.18.0/en/performance#tf32)์„œ ๋” ์ฝ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ถ”๋ก ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:
```python
import torch
torch.backends.cuda.matmul.allow_tf32 = True
```
## ๋ฐ˜์ •๋ฐ€๋„ ๊ฐ€์ค‘์น˜
๋” ๋งŽ์€ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๊ณ  ๋” ๋น ๋ฅธ ์†๋„๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜์ •๋ฐ€๋„(half precision)๋กœ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์—๋Š” `fp16`์ด๋ผ๋Š” ๋ธŒ๋žœ์น˜์— ์ €์žฅ๋œ float16 ๋ฒ„์ „์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ , ๊ทธ ๋•Œ `float16` ์œ ํ˜•์„ ์‚ฌ์šฉํ•˜๋„๋ก PyTorch์— ์ง€์‹œํ•˜๋Š” ์ž‘์—…์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
```Python
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
```
<Tip warning={true}>
์–ด๋–ค ํŒŒ์ดํ”„๋ผ์ธ์—์„œ๋„ [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๊ฒ€์€์ƒ‰ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ณ , ์ˆœ์ˆ˜ํ•œ float16 ์ •๋ฐ€๋„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ•ญ์ƒ ๋Š๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
</Tip>
## ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ์„ ์œ„ํ•œ ์Šฌ๋ผ์ด์Šค ์–ดํ…์…˜
์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ์„ ์œ„ํ•ด, ํ•œ ๋ฒˆ์— ๋ชจ๋‘ ๊ณ„์‚ฐํ•˜๋Š” ๋Œ€์‹  ๋‹จ๊ณ„์ ์œผ๋กœ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์Šฌ๋ผ์ด์Šค ๋ฒ„์ „์˜ ์–ดํ…์…˜(attention)์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
<Tip>
Attention slicing์€ ๋ชจ๋ธ์ด ํ•˜๋‚˜ ์ด์ƒ์˜ ์–ดํ…์…˜ ํ—ค๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ•œ, ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 1์ธ ๊ฒฝ์šฐ์—๋„ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
ํ•˜๋‚˜ ์ด์ƒ์˜ ์–ดํ…์…˜ ํ—ค๋“œ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ *QK^T* ์–ดํ…์…˜ ๋งคํŠธ๋ฆญ์Šค๋Š” ์ƒ๋‹นํ•œ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ ํ—ค๋“œ์— ๋Œ€ํ•ด ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
</Tip>
๊ฐ ํ—ค๋“œ์— ๋Œ€ํ•ด ์ˆœ์ฐจ์ ์œผ๋กœ ์–ดํ…์…˜ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถ”๋ก  ์ „์— ํŒŒ์ดํ”„๋ผ์ธ์—์„œ [`~StableDiffusionPipeline.enable_attention_slicing`]๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_attention_slicing()
image = pipe(prompt).images[0]
```
์ถ”๋ก  ์‹œ๊ฐ„์ด ์•ฝ 10% ๋Š๋ ค์ง€๋Š” ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์žˆ์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด 3.2GB ์ •๋„์˜ ์ž‘์€ VRAM์œผ๋กœ๋„ Stable Diffusion์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!
## ๋” ํฐ ๋ฐฐ์น˜๋ฅผ ์œ„ํ•œ sliced VAE ๋””์ฝ”๋“œ
์ œํ•œ๋œ VRAM์—์„œ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ ๋ฐฐ์น˜๋ฅผ ๋””์ฝ”๋”ฉํ•˜๊ฑฐ๋‚˜ 32๊ฐœ ์ด์ƒ์˜ ์ด๋ฏธ์ง€๊ฐ€ ํฌํ•จ๋œ ๋ฐฐ์น˜๋ฅผ ํ™œ์„ฑํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ๋ฐฐ์น˜์˜ latent ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์”ฉ ๋””์ฝ”๋”ฉํ•˜๋Š” ์Šฌ๋ผ์ด์Šค VAE ๋””์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด๋ฅผ [`~StableDiffusionPipeline.enable_attention_slicing`] ๋˜๋Š” [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`]๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ถ”๊ฐ€๋กœ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
VAE ๋””์ฝ”๋“œ๋ฅผ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์”ฉ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด ์ถ”๋ก  ์ „์— ํŒŒ์ดํ”„๋ผ์ธ์—์„œ [`~StableDiffusionPipeline.enable_vae_slicing`]์„ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด:
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_vae_slicing()
images = pipe([prompt] * 32).images
```
๋‹ค์ค‘ ์ด๋ฏธ์ง€ ๋ฐฐ์น˜์—์„œ VAE ๋””์ฝ”๋“œ๊ฐ€ ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ๋‹จ์ผ ์ด๋ฏธ์ง€ ๋ฐฐ์น˜์—์„œ๋Š” ์„ฑ๋Šฅ ์˜ํ–ฅ์€ ์—†์Šต๋‹ˆ๋‹ค.
<a name="sequential_offloading"></a>
## ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ์„ ์œ„ํ•ด ๊ฐ€์† ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜์—ฌ CPU๋กœ ์˜คํ”„๋กœ๋”ฉ
์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ์„ ์œ„ํ•ด ๊ฐ€์ค‘์น˜๋ฅผ CPU๋กœ ์˜คํ”„๋กœ๋“œํ•˜๊ณ  ์ˆœ๋ฐฉํ–ฅ ์ „๋‹ฌ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ๋งŒ GPU๋กœ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
CPU ์˜คํ”„๋กœ๋”ฉ์„ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]๋ฅผ ํ˜ธ์ถœํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]
```
๊ทธ๋Ÿฌ๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋ฅผ 3GB ๋ฏธ๋งŒ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ฐธ๊ณ ๋กœ ์ด ๋ฐฉ๋ฒ•์€ ์ „์ฒด ๋ชจ๋ธ์ด ์•„๋‹Œ ์„œ๋ธŒ๋ชจ๋“ˆ ์ˆ˜์ค€์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ ํ”„๋กœ์„ธ์Šค์˜ ๋ฐ˜๋ณต์  ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ์ถ”๋ก  ์†๋„๊ฐ€ ํ›จ์”ฌ ๋Š๋ฆฝ๋‹ˆ๋‹ค. ํŒŒ์ดํ”„๋ผ์ธ์˜ UNet ๊ตฌ์„ฑ ์š”์†Œ๋Š” ์—ฌ๋Ÿฌ ๋ฒˆ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค('num_inference_steps' ๋งŒํผ). ๋งค๋ฒˆ UNet์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์„œ๋ธŒ๋ชจ๋“ˆ์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์˜จ๋กœ๋“œ๋œ ๋‹ค์Œ ํ•„์š”์— ๋”ฐ๋ผ ์˜คํ”„๋กœ๋“œ๋˜๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ด๋™ ํšŸ์ˆ˜๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.
<Tip>
๋˜ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ธ <a href="#model_offloading">๋ชจ๋ธ ์˜คํ”„๋กœ๋”ฉ</a>์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•˜์‹ญ์‹œ์˜ค. ์ด๋Š” ํ›จ์”ฌ ๋น ๋ฅด์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ์ด ํฌ์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.
</Tip>
๋˜ํ•œ ttention slicing๊ณผ ์—ฐ๊ฒฐํ•ด์„œ ์ตœ์†Œ ๋ฉ”๋ชจ๋ฆฌ(< 2GB)๋กœ๋„ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing(1)
image = pipe(prompt).images[0]
```
**์ฐธ๊ณ **: 'enable_sequential_cpu_offload()'๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ, ๋ฏธ๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์„ CUDA๋กœ ์ด๋™ํ•˜์ง€ **์•Š๋Š”** ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„์˜ ์ด๋“์ด ์ตœ์†Œํ™”๋ฉ๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์œ„ํ•ด [์ด ์ด์Šˆ](https://github.com/huggingface/diffusers/issues/1934)๋ฅผ ๋ณด์„ธ์š”.
<a name="model_offloading"></a>
## ๋น ๋ฅธ ์ถ”๋ก ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ์„ ์œ„ํ•œ ๋ชจ๋ธ ์˜คํ”„๋กœ๋”ฉ
[์ˆœ์ฐจ์  CPU ์˜คํ”„๋กœ๋”ฉ](#sequential_offloading)์€ ์ด์ „ ์„น์…˜์—์„œ ์„ค๋ช…ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ณด์กดํ•˜์ง€๋งŒ ํ•„์š”์— ๋”ฐ๋ผ ์„œ๋ธŒ๋ชจ๋“ˆ์„ GPU๋กœ ์ด๋™ํ•˜๊ณ  ์ƒˆ ๋ชจ๋“ˆ์ด ์‹คํ–‰๋  ๋•Œ ์ฆ‰์‹œ CPU๋กœ ๋ฐ˜ํ™˜๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๋ก  ์†๋„๊ฐ€ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.
์ „์ฒด ๋ชจ๋ธ ์˜คํ”„๋กœ๋”ฉ์€ ๊ฐ ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ ์š”์†Œ์ธ _modules_์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹ , ์ „์ฒด ๋ชจ๋ธ์„ GPU๋กœ ์ด๋™ํ•˜๋Š” ๋Œ€์•ˆ์ž…๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์ถ”๋ก  ์‹œ๊ฐ„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€ ๋ฏธ๋ฏธํ•˜์ง€๋งŒ(ํŒŒ์ดํ”„๋ผ์ธ์„ 'cuda'๋กœ ์ด๋™ํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„๊ตํ•˜์—ฌ) ์—ฌ์ „ํžˆ ์•ฝ๊ฐ„์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ ์ค‘ ํ•˜๋‚˜๋งŒ(์ผ๋ฐ˜์ ์œผ๋กœ ํ…์ŠคํŠธ ์ธ์ฝ”๋”, unet ๋ฐ vae) GPU์— ์žˆ๊ณ , ๋‚˜๋จธ์ง€๋Š” CPU์—์„œ ๋Œ€๊ธฐํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์—ฌ๋Ÿฌ ๋ฐ˜๋ณต์„ ์œ„ํ•ด ์‹คํ–‰๋˜๋Š” UNet๊ณผ ๊ฐ™์€ ๊ตฌ์„ฑ ์š”์†Œ๋Š” ๋” ์ด์ƒ ํ•„์š”ํ•˜์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ GPU์— ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ๊ธฐ๋Šฅ์€ ์•„๋ž˜์™€ ๊ฐ™์ด ํŒŒ์ดํ”„๋ผ์ธ์—์„œ `enable_model_cpu_offload()`๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
image = pipe(prompt).images[0]
```
์ด๋Š” ์ถ”๊ฐ€์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ์„ ์œ„ํ•œ attention slicing๊ณผ๋„ ํ˜ธํ™˜๋ฉ๋‹ˆ๋‹ค.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing(1)
image = pipe(prompt).images[0]
```
<Tip>
์ด ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด 'accelerate' ๋ฒ„์ „ 0.17.0 ์ด์ƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
</Tip>
## Channels Last ๋ฉ”๋ชจ๋ฆฌ ํ˜•์‹ ์‚ฌ์šฉํ•˜๊ธฐ
Channels Last ๋ฉ”๋ชจ๋ฆฌ ํ˜•์‹์€ ์ฐจ์› ์ˆœ์„œ๋ฅผ ๋ณด์กดํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ์—์„œ NCHW ํ…์„œ ๋ฐฐ์—ด์„ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
Channels Last ํ…์„œ๋Š” ์ฑ„๋„์ด ๊ฐ€์žฅ ์กฐ๋ฐ€ํ•œ ์ฐจ์›์ด ๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ •๋ ฌ๋ฉ๋‹ˆ๋‹ค(์ผ๋ช… ํ”ฝ์…€๋‹น ์ด๋ฏธ์ง€๋ฅผ ์ €์žฅ).
ํ˜„์žฌ ๋ชจ๋“  ์—ฐ์‚ฐ์ž Channels Last ํ˜•์‹์„ ์ง€์›ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ผ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์‚ฌ์šฉํ•ด๋ณด๊ณ  ๋ชจ๋ธ์— ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด ํŒŒ์ดํ”„๋ผ์ธ์˜ UNet ๋ชจ๋ธ์ด channels Last ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๋„๋ก ์„ค์ •ํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1)
pipe.unet.to(memory_format=torch.channels_last) # in-place ์—ฐ์‚ฐ
# 2๋ฒˆ์งธ ์ฐจ์›์—์„œ ์ŠคํŠธ๋ผ์ด๋“œ 1์„ ๊ฐ€์ง€๋Š” (2880, 1, 960, 320)๋กœ, ์—ฐ์‚ฐ์ด ์ž‘๋™ํ•จ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.
print(pipe.unet.conv_out.state_dict()["weight"].stride())
```
## ์ถ”์ (tracing)
์ถ”์ ์€ ๋ชจ๋ธ์„ ํ†ตํ•ด ์˜ˆ์ œ ์ž…๋ ฅ ํ…์„œ๋ฅผ ํ†ตํ•ด ์‹คํ–‰๋˜๋Š”๋ฐ, ํ•ด๋‹น ์ž…๋ ฅ์ด ๋ชจ๋ธ์˜ ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผํ•  ๋•Œ ํ˜ธ์ถœ๋˜๋Š” ์ž‘์—…์„ ์บก์ฒ˜ํ•˜์—ฌ ์‹คํ–‰ ํŒŒ์ผ ๋˜๋Š” 'ScriptFunction'์ด ๋ฐ˜ํ™˜๋˜๋„๋ก ํ•˜๊ณ , ์ด๋Š” just-in-time ์ปดํŒŒ์ผ๋กœ ์ตœ์ ํ™”๋ฉ๋‹ˆ๋‹ค.
UNet ๋ชจ๋ธ์„ ์ถ”์ ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
import time
import torch
from diffusers import StableDiffusionPipeline
import functools
# torch ๊ธฐ์šธ๊ธฐ ๋น„ํ™œ์„ฑํ™”
torch.set_grad_enabled(False)
# ๋ณ€์ˆ˜ ์„ค์ •
n_experiments = 2
unet_runs_per_experiment = 50
# ์ž…๋ ฅ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
def generate_inputs():
sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
return sample, timestep, encoder_hidden_states
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
unet = pipe.unet
unet.eval()
unet.to(memory_format=torch.channels_last) # Channels Last ๋ฉ”๋ชจ๋ฆฌ ํ˜•์‹ ์‚ฌ์šฉ
unet.forward = functools.partial(unet.forward, return_dict=False) # return_dict=False์„ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์„ค์ •
# ์›Œ๋ฐ์—…
for _ in range(3):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet(*inputs)
# ์ถ”์ 
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")
# ์›Œ๋ฐ์—… ๋ฐ ๊ทธ๋ž˜ํ”„ ์ตœ์ ํ™”
for _ in range(5):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet_traced(*inputs)
# ๋ฒค์น˜๋งˆํ‚น
with torch.inference_mode():
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet_traced(*inputs)
torch.cuda.synchronize()
print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet(*inputs)
torch.cuda.synchronize()
print(f"unet inference took {time.time() - start_time:.2f} seconds")
# ๋ชจ๋ธ ์ €์žฅ
unet_traced.save("unet_traced.pt")
```
๊ทธ ๋‹ค์Œ, ํŒŒ์ดํ”„๋ผ์ธ์˜ `unet` ํŠน์„ฑ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถ”์ ๋œ ๋ชจ๋ธ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```python
from diffusers import StableDiffusionPipeline
import torch
from dataclasses import dataclass
@dataclass
class UNet2DConditionOutput:
sample: torch.Tensor
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
# jitted unet ์‚ฌ์šฉ
unet_traced = torch.jit.load("unet_traced.pt")
# pipe.unet ์‚ญ์ œ
class TracedUNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.in_channels = pipe.unet.config.in_channels
self.device = pipe.unet.device
def forward(self, latent_model_input, t, encoder_hidden_states):
sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
return UNet2DConditionOutput(sample=sample)
pipe.unet = TracedUNet()
with torch.inference_mode():
image = pipe([prompt] * 1, num_inference_steps=50).images[0]
```
## Memory-efficient attention
์–ดํ…์…˜ ๋ธ”๋ก์˜ ๋Œ€์—ญํญ์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ตœ๊ทผ ์ž‘์—…์œผ๋กœ GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜๊ณ  ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
@tridao์˜ ๊ฐ€์žฅ ์ตœ๊ทผ์˜ ํ”Œ๋ž˜์‹œ ์–ดํ…์…˜: [code](https://github.com/HazyResearch/flash-attention), [paper](https://arxiv.org/pdf/2205.14135.pdf).
๋ฐฐ์น˜ ํฌ๊ธฐ 1(ํ”„๋กฌํ”„ํŠธ 1๊ฐœ)์˜ 512x512 ํฌ๊ธฐ๋กœ ์ถ”๋ก ์„ ์‹คํ–‰ํ•  ๋•Œ ๋ช‡ ๊ฐ€์ง€ Nvidia GPU์—์„œ ์–ป์€ ์†๋„ ํ–ฅ์ƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
| GPU | ๊ธฐ์ค€ ์–ดํ…์…˜ FP16 | ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ์–ดํ…์…˜ FP16 |
|------------------ |--------------------- |--------------------------------- |
| NVIDIA Tesla T4 | 3.5it/s | 5.5it/s |
| NVIDIA 3060 RTX | 4.6it/s | 7.8it/s |
| NVIDIA A10G | 8.88it/s | 15.6it/s |
| NVIDIA RTX A6000 | 11.7it/s | 21.09it/s |
| NVIDIA TITAN RTX | 12.51it/s | 18.22it/s |
| A100-SXM4-40GB | 18.6it/s | 29.it/s |
| A100-SXM-80GB | 18.7it/s | 29.5it/s |
์ด๋ฅผ ํ™œ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ๋งŒ์กฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
- PyTorch > 1.12
- Cuda ์‚ฌ์šฉ ๊ฐ€๋Šฅ
- [xformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•จ](xformers)
```python
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
with torch.inference_mode():
sample = pipe("a small cat")
# ์„ ํƒ: ์ด๋ฅผ ๋น„ํ™œ์„ฑํ™” ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# pipe.disable_xformers_memory_efficient_attention()
```