File size: 17,265 Bytes
87d40d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# λ©”λͺ¨λ¦¬μ™€ 속도

λ©”λͺ¨λ¦¬ λ˜λŠ” 속도에 λŒ€ν•΄ πŸ€— Diffusers *μΆ”λ‘ *을 μ΅œμ ν™”ν•˜κΈ° μœ„ν•œ λͺ‡ 가지 기술과 아이디어λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€.
일반적으둜, memory-efficient attention을 μœ„ν•΄ [xFormers](https://github.com/facebookresearch/xformers) μ‚¬μš©μ„ μΆ”μ²œν•˜κΈ° λ•Œλ¬Έμ—, μΆ”μ²œν•˜λŠ” [μ„€μΉ˜ 방법](xformers)을 보고 μ„€μΉ˜ν•΄ λ³΄μ„Έμš”.

λ‹€μŒ 섀정이 μ„±λŠ₯κ³Ό λ©”λͺ¨λ¦¬μ— λ―ΈμΉ˜λŠ” 영ν–₯에 λŒ€ν•΄ μ„€λͺ…ν•©λ‹ˆλ‹€.

|                  | μ§€μ—°μ‹œκ°„  | 속도 ν–₯상 |
| ---------------- | ------- | ------- |
| 별도 μ„€μ • μ—†μŒ      | 9.50s   | x1      |
| cuDNN auto-tuner | 9.37s   | x1.01   |
| fp16             | 3.61s   | x2.63   |
| Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹     | 3.30s   | x2.88   |
| traced UNet      | 3.21s   | x2.96   |
| memory-efficient attention | 2.63s  | x3.61   |

<em>
   NVIDIA TITAN RTXμ—μ„œ 50 DDIM μŠ€ν…μ˜ "a photo of an astronaut riding a horse on mars" ν”„λ‘¬ν”„νŠΈλ‘œ 512x512 크기의 단일 이미지λ₯Ό μƒμ„±ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
</em>

## cuDNN auto-tuner ν™œμ„±ν™”ν•˜κΈ°

[NVIDIA cuDNN](https://developer.nvidia.com/cudnn)은 μ»¨λ³Όλ£¨μ…˜μ„ κ³„μ‚°ν•˜λŠ” λ§Žμ€ μ•Œκ³ λ¦¬μ¦˜μ„ μ§€μ›ν•©λ‹ˆλ‹€. AutotunerλŠ” 짧은 벀치마크λ₯Ό μ‹€ν–‰ν•˜κ³  주어진 μž…λ ₯ 크기에 λŒ€ν•΄ 주어진 ν•˜λ“œμ›¨μ–΄μ—μ„œ 졜고의 μ„±λŠ₯을 가진 컀널을 μ„ νƒν•©λ‹ˆλ‹€.

**μ»¨λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬**λ₯Ό ν™œμš©ν•˜κ³  있기 λ•Œλ¬Έμ— (λ‹€λ₯Έ μœ ν˜•λ“€μ€ ν˜„μž¬ μ§€μ›λ˜μ§€ μ•ŠμŒ), λ‹€μŒ 섀정을 톡해 μΆ”λ‘  전에 cuDNN autotunerλ₯Ό ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```python
import torch

torch.backends.cudnn.benchmark = True
```

### fp32 λŒ€μ‹  tf32 μ‚¬μš©ν•˜κΈ°  (Ampere 및 이후 CUDA μž₯μΉ˜λ“€μ—μ„œ)

Ampere 및 이후 CUDA μž₯μΉ˜μ—μ„œ ν–‰λ ¬κ³± 및 μ»¨λ³Όλ£¨μ…˜μ€ TensorFloat32(TF32) λͺ¨λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 더 λΉ λ₯΄μ§€λ§Œ μ•½κ°„ 덜 μ •ν™•ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
기본적으둜 PyTorchλŠ” μ»¨λ³Όλ£¨μ…˜μ— λŒ€ν•΄ TF32 λͺ¨λ“œλ₯Ό ν™œμ„±ν™”ν•˜μ§€λ§Œ ν–‰λ ¬ κ³±μ…ˆμ€ ν™œμ„±ν™”ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
λ„€νŠΈμ›Œν¬μ— μ™„μ „ν•œ float32 정밀도가 ν•„μš”ν•œ κ²½μš°κ°€ μ•„λ‹ˆλ©΄ ν–‰λ ¬ κ³±μ…ˆμ— λŒ€ν•΄μ„œλ„ 이 섀정을 ν™œμ„±ν™”ν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.
μ΄λŠ” 일반적으둜 λ¬΄μ‹œν•  수 μžˆλŠ” 수치의 정확도 손싀이 μžˆμ§€λ§Œ, 계산 속도λ₯Ό 크게 높일 수 μžˆμŠ΅λ‹ˆλ‹€.
그것에 λŒ€ν•΄ [μ—¬κΈ°](https://huggingface.co/docs/transformers/v4.18.0/en/performance#tf32)μ„œ 더 읽을 수 μžˆμŠ΅λ‹ˆλ‹€.
μΆ”λ‘ ν•˜κΈ° 전에 λ‹€μŒμ„ μΆ”κ°€ν•˜κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€:

```python
import torch

torch.backends.cuda.matmul.allow_tf32 = True
```

## λ°˜μ •λ°€λ„ κ°€μ€‘μΉ˜

더 λ§Žμ€ GPU λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•˜κ³  더 λΉ λ₯Έ 속도λ₯Ό μ–»κΈ° μœ„ν•΄ λͺ¨λΈ κ°€μ€‘μΉ˜λ₯Ό λ°˜μ •λ°€λ„(half precision)둜 직접 뢈러였고 μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
μ—¬κΈ°μ—λŠ” `fp16`μ΄λΌλŠ” λΈŒλžœμΉ˜μ— μ €μž₯된 float16 λ²„μ „μ˜ κ°€μ€‘μΉ˜λ₯Ό 뢈러였고, κ·Έ λ•Œ `float16` μœ ν˜•μ„ μ‚¬μš©ν•˜λ„λ‘ PyTorch에 μ§€μ‹œν•˜λŠ” μž‘μ—…μ΄ ν¬ν•¨λ©λ‹ˆλ‹€.

```Python
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
```

<Tip warning={true}>
  μ–΄λ–€ νŒŒμ΄ν”„λΌμΈμ—μ„œλ„ [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) λ₯Ό μ‚¬μš©ν•˜λŠ” 것은 검은색 이미지λ₯Ό 생성할 수 있고, μˆœμˆ˜ν•œ float16 정밀도λ₯Ό μ‚¬μš©ν•˜λŠ” 것보닀 항상 느리기 λ•Œλ¬Έμ— μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.
</Tip>

## μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ 슬라이슀 μ–΄ν…μ…˜

μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄, ν•œ λ²ˆμ— λͺ¨λ‘ κ³„μ‚°ν•˜λŠ” λŒ€μ‹  λ‹¨κ³„μ μœΌλ‘œ 계산을 μˆ˜ν–‰ν•˜λŠ” 슬라이슀 λ²„μ „μ˜ μ–΄ν…μ…˜(attention)을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

<Tip>
  Attention slicing은 λͺ¨λΈμ΄ ν•˜λ‚˜ μ΄μƒμ˜ μ–΄ν…μ…˜ ν—€λ“œλ₯Ό μ‚¬μš©ν•˜λŠ” ν•œ, 배치 크기가 1인 κ²½μš°μ—λ„ μœ μš©ν•©λ‹ˆλ‹€.
  ν•˜λ‚˜ μ΄μƒμ˜ μ–΄ν…μ…˜ ν—€λ“œκ°€ μžˆλŠ” 경우 *QK^T* μ–΄ν…μ…˜ λ§€νŠΈλ¦­μŠ€λŠ” μƒλ‹Ήν•œ μ–‘μ˜ λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•  수 μžˆλŠ” 각 ν—€λ“œμ— λŒ€ν•΄ 순차적으둜 계산될 수 μžˆμŠ΅λ‹ˆλ‹€.
</Tip>

각 ν—€λ“œμ— λŒ€ν•΄ 순차적으둜 μ–΄ν…μ…˜ 계산을 μˆ˜ν–‰ν•˜λ €λ©΄, λ‹€μŒκ³Ό 같이 μΆ”λ‘  전에 νŒŒμ΄ν”„λΌμΈμ—μ„œ [`~StableDiffusionPipeline.enable_attention_slicing`]λ₯Ό ν˜ΈμΆœν•˜λ©΄ λ©λ‹ˆλ‹€:

```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_attention_slicing()
image = pipe(prompt).images[0]
```

μΆ”λ‘  μ‹œκ°„μ΄ μ•½ 10% λŠλ €μ§€λŠ” μ•½κ°„μ˜ μ„±λŠ₯ μ €ν•˜κ°€ μžˆμ§€λ§Œ 이 방법을 μ‚¬μš©ν•˜λ©΄ 3.2GB μ •λ„μ˜ μž‘μ€ VRAMμœΌλ‘œλ„ Stable Diffusion을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€!


## 더 큰 배치λ₯Ό μœ„ν•œ sliced VAE λ””μ½”λ“œ

μ œν•œλœ VRAMμ—μ„œ λŒ€κ·œλͺ¨ 이미지 배치λ₯Ό λ””μ½”λ”©ν•˜κ±°λ‚˜ 32개 μ΄μƒμ˜ 이미지가 ν¬ν•¨λœ 배치λ₯Ό ν™œμ„±ν™”ν•˜κΈ° μœ„ν•΄, 배치의 latent 이미지λ₯Ό ν•œ λ²ˆμ— ν•˜λ‚˜μ”© λ””μ½”λ”©ν•˜λŠ” 슬라이슀 VAE λ””μ½”λ“œλ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

이λ₯Ό [`~StableDiffusionPipeline.enable_attention_slicing`] λ˜λŠ” [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`]κ³Ό κ²°ν•©ν•˜μ—¬ λ©”λͺ¨λ¦¬ μ‚¬μš©μ„ μΆ”κ°€λ‘œ μ΅œμ†Œν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

VAE λ””μ½”λ“œλ₯Ό ν•œ λ²ˆμ— ν•˜λ‚˜μ”© μˆ˜ν–‰ν•˜λ €λ©΄ μΆ”λ‘  전에 νŒŒμ΄ν”„λΌμΈμ—μ„œ [`~StableDiffusionPipeline.enable_vae_slicing`]을 ν˜ΈμΆœν•©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:

```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_vae_slicing()
images = pipe([prompt] * 32).images
```

닀쀑 이미지 λ°°μΉ˜μ—μ„œ VAE λ””μ½”λ“œκ°€ μ•½κ°„μ˜ μ„±λŠ₯ ν–₯상이 μ΄λ£¨μ–΄μ§‘λ‹ˆλ‹€. 단일 이미지 λ°°μΉ˜μ—μ„œλŠ” μ„±λŠ₯ 영ν–₯은 μ—†μŠ΅λ‹ˆλ‹€.


<a name="sequential_offloading"></a>
## λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄ 가속 κΈ°λŠ₯을 μ‚¬μš©ν•˜μ—¬ CPU둜 μ˜€ν”„λ‘œλ”©

μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄ κ°€μ€‘μΉ˜λ₯Ό CPU둜 μ˜€ν”„λ‘œλ“œν•˜κ³  순방ν–₯ 전달을 μˆ˜ν–‰ν•  λ•Œλ§Œ GPU둜 λ‘œλ“œν•  수 μžˆμŠ΅λ‹ˆλ‹€.

CPU μ˜€ν”„λ‘œλ”©μ„ μˆ˜ν–‰ν•˜λ €λ©΄ [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]λ₯Ό ν˜ΈμΆœν•˜κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€:

```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]
```

그러면 λ©”λͺ¨λ¦¬ μ†ŒλΉ„λ₯Ό 3GB 미만으둜 쀄일 수 μžˆμŠ΅λ‹ˆλ‹€.

참고둜 이 방법은 전체 λͺ¨λΈμ΄ μ•„λ‹Œ μ„œλΈŒλͺ¨λ“ˆ μˆ˜μ€€μ—μ„œ μž‘λ™ν•©λ‹ˆλ‹€. μ΄λŠ” λ©”λͺ¨λ¦¬ μ†ŒλΉ„λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” κ°€μž₯ 쒋은 λ°©λ²•μ΄μ§€λ§Œ ν”„λ‘œμ„ΈμŠ€μ˜ 반볡적 νŠΉμ„±μœΌλ‘œ 인해 μΆ”λ‘  속도가 훨씬 λŠλ¦½λ‹ˆλ‹€. νŒŒμ΄ν”„λΌμΈμ˜ UNet ꡬ성 μš”μ†ŒλŠ” μ—¬λŸ¬ 번 μ‹€ν–‰λ©λ‹ˆλ‹€('num_inference_steps' 만큼). 맀번 UNet의 μ„œλ‘œ λ‹€λ₯Έ μ„œλΈŒλͺ¨λ“ˆμ΄ 순차적으둜 μ˜¨λ‘œλ“œλœ λ‹€μŒ ν•„μš”μ— 따라 μ˜€ν”„λ‘œλ“œλ˜λ―€λ‘œ λ©”λͺ¨λ¦¬ 이동 νšŸμˆ˜κ°€ λ§ŽμŠ΅λ‹ˆλ‹€.

<Tip>
또 λ‹€λ₯Έ μ΅œμ ν™” 방법인 <a href="#model_offloading">λͺ¨λΈ μ˜€ν”„λ‘œλ”©</a>을 μ‚¬μš©ν•˜λŠ” 것을 κ³ λ €ν•˜μ‹­μ‹œμ˜€. μ΄λŠ” 훨씬 λΉ λ₯΄μ§€λ§Œ λ©”λͺ¨λ¦¬ μ ˆμ•½μ΄ ν¬μ§€λŠ” μ•ŠμŠ΅λ‹ˆλ‹€.
</Tip>

λ˜ν•œ ttention slicingκ³Ό μ—°κ²°ν•΄μ„œ μ΅œμ†Œ λ©”λͺ¨λ¦¬(< 2GB)λ‘œλ„ λ™μž‘ν•  수 μžˆμŠ΅λ‹ˆλ‹€.


```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing(1)

image = pipe(prompt).images[0]
```

**μ°Έκ³ **: 'enable_sequential_cpu_offload()'λ₯Ό μ‚¬μš©ν•  λ•Œ, 미리 νŒŒμ΄ν”„λΌμΈμ„ CUDA둜 μ΄λ™ν•˜μ§€ **μ•ŠλŠ”** 것이 μ€‘μš”ν•©λ‹ˆλ‹€.그렇지 μ•ŠμœΌλ©΄ λ©”λͺ¨λ¦¬ μ†ŒλΉ„μ˜ 이득이 μ΅œμ†Œν™”λ©λ‹ˆλ‹€. 더 λ§Žμ€ 정보λ₯Ό μœ„ν•΄ [이 이슈](https://github.com/huggingface/diffusers/issues/1934)λ₯Ό λ³΄μ„Έμš”.

<a name="model_offloading"></a>
## λΉ λ₯Έ μΆ”λ‘ κ³Ό λ©”λͺ¨λ¦¬ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ λͺ¨λΈ μ˜€ν”„λ‘œλ”©

[순차적 CPU μ˜€ν”„λ‘œλ”©](#sequential_offloading)은 이전 μ„Ήμ…˜μ—μ„œ μ„€λͺ…ν•œ κ²ƒμ²˜λŸΌ λ§Žμ€ λ©”λͺ¨λ¦¬λ₯Ό λ³΄μ‘΄ν•˜μ§€λ§Œ ν•„μš”μ— 따라 μ„œλΈŒλͺ¨λ“ˆμ„ GPU둜 μ΄λ™ν•˜κ³  μƒˆ λͺ¨λ“ˆμ΄ 싀행될 λ•Œ μ¦‰μ‹œ CPU둜 λ°˜ν™˜λ˜κΈ° λ•Œλ¬Έμ— μΆ”λ‘  속도가 λŠλ €μ§‘λ‹ˆλ‹€.

전체 λͺ¨λΈ μ˜€ν”„λ‘œλ”©μ€ 각 λͺ¨λΈμ˜ ꡬ성 μš”μ†ŒμΈ _modules_을 μ²˜λ¦¬ν•˜λŠ” λŒ€μ‹ , 전체 λͺ¨λΈμ„ GPU둜 μ΄λ™ν•˜λŠ” λŒ€μ•ˆμž…λ‹ˆλ‹€. 이둜 인해 μΆ”λ‘  μ‹œκ°„μ— λ―ΈμΉ˜λŠ” 영ν–₯은 λ―Έλ―Έν•˜μ§€λ§Œ(νŒŒμ΄ν”„λΌμΈμ„ 'cuda'둜 μ΄λ™ν•˜λŠ” 것과 λΉ„κ΅ν•˜μ—¬) μ—¬μ „νžˆ μ•½κ°„μ˜ λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

이 μ‹œλ‚˜λ¦¬μ˜€μ—μ„œλŠ” νŒŒμ΄ν”„λΌμΈμ˜ μ£Όμš” ꡬ성 μš”μ†Œ 쀑 ν•˜λ‚˜λ§Œ(일반적으둜 ν…μŠ€νŠΈ 인코더, unet 및 vae) GPU에 있고, λ‚˜λ¨Έμ§€λŠ” CPUμ—μ„œ λŒ€κΈ°ν•  κ²ƒμž…λ‹ˆλ‹€.
μ—¬λŸ¬ λ°˜λ³΅μ„ μœ„ν•΄ μ‹€ν–‰λ˜λŠ” UNetκ³Ό 같은 ꡬ성 μš”μ†ŒλŠ” 더 이상 ν•„μš”ν•˜μ§€ μ•Šμ„ λ•ŒκΉŒμ§€ GPU에 남아 μžˆμŠ΅λ‹ˆλ‹€.

이 κΈ°λŠ₯은 μ•„λž˜μ™€ 같이 νŒŒμ΄ν”„λΌμΈμ—μ„œ `enable_model_cpu_offload()`λ₯Ό ν˜ΈμΆœν•˜μ—¬ ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
image = pipe(prompt).images[0]
```

μ΄λŠ” 좔가적인 λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ attention slicing과도 ν˜Έν™˜λ©λ‹ˆλ‹€.

```Python
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing(1)

image = pipe(prompt).images[0]
```

<Tip>
이 κΈ°λŠ₯을 μ‚¬μš©ν•˜λ €λ©΄ 'accelerate' 버전 0.17.0 이상이 ν•„μš”ν•©λ‹ˆλ‹€.
</Tip>

## Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ μ‚¬μš©ν•˜κΈ°

Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹μ€ 차원 μˆœμ„œλ₯Ό λ³΄μ‘΄ν•˜λŠ” λ©”λͺ¨λ¦¬μ—μ„œ NCHW ν…μ„œ 배열을 λŒ€μ²΄ν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€.
Channels Last ν…μ„œλŠ” 채널이 κ°€μž₯ μ‘°λ°€ν•œ 차원이 λ˜λŠ” λ°©μ‹μœΌλ‘œ μ •λ ¬λ©λ‹ˆλ‹€(일λͺ… ν”½μ…€λ‹Ή 이미지λ₯Ό μ €μž₯).
ν˜„μž¬ λͺ¨λ“  μ—°μ‚°μž Channels Last ν˜•μ‹μ„ μ§€μ›ν•˜λŠ” 것은 μ•„λ‹ˆλΌ μ„±λŠ₯이 μ €ν•˜λ  수 μžˆμœΌλ―€λ‘œ, μ‚¬μš©ν•΄λ³΄κ³  λͺ¨λΈμ— 잘 μž‘λ™ν•˜λŠ”μ§€ ν™•μΈν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.


예λ₯Ό λ“€μ–΄ νŒŒμ΄ν”„λΌμΈμ˜ UNet λͺ¨λΈμ΄ channels Last ν˜•μ‹μ„ μ‚¬μš©ν•˜λ„λ‘ μ„€μ •ν•˜λ €λ©΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```python
print(pipe.unet.conv_out.state_dict()["weight"].stride())  # (2880, 9, 3, 1)
pipe.unet.to(memory_format=torch.channels_last)  # in-place μ—°μ‚°
# 2번째 μ°¨μ›μ—μ„œ μŠ€νŠΈλΌμ΄λ“œ 1을 κ°€μ§€λŠ” (2880, 1, 960, 320)둜, 연산이 μž‘λ™ν•¨μ„ 증λͺ…ν•©λ‹ˆλ‹€.
print(pipe.unet.conv_out.state_dict()["weight"].stride())
```

## 좔적(tracing)

좔적은 λͺ¨λΈμ„ 톡해 예제 μž…λ ₯ ν…μ„œλ₯Ό 톡해 μ‹€ν–‰λ˜λŠ”λ°, ν•΄λ‹Ή μž…λ ₯이 λͺ¨λΈμ˜ λ ˆμ΄μ–΄λ₯Ό 톡과할 λ•Œ ν˜ΈμΆœλ˜λŠ” μž‘μ—…μ„ μΊ‘μ²˜ν•˜μ—¬ μ‹€ν–‰ 파일 λ˜λŠ” 'ScriptFunction'이 λ°˜ν™˜λ˜λ„λ‘ ν•˜κ³ , μ΄λŠ” just-in-time 컴파일둜 μ΅œμ ν™”λ©λ‹ˆλ‹€.

UNet λͺ¨λΈμ„ μΆ”μ ν•˜κΈ° μœ„ν•΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```python
import time
import torch
from diffusers import StableDiffusionPipeline
import functools

# torch 기울기 λΉ„ν™œμ„±ν™”
torch.set_grad_enabled(False)

# λ³€μˆ˜ μ„€μ •
n_experiments = 2
unet_runs_per_experiment = 50


# μž…λ ₯ 뢈러였기
def generate_inputs():
    sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
    timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
    encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
    return sample, timestep, encoder_hidden_states


pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")
unet = pipe.unet
unet.eval()
unet.to(memory_format=torch.channels_last)  # Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ μ‚¬μš©
unet.forward = functools.partial(unet.forward, return_dict=False)  # return_dict=False을 κΈ°λ³Έκ°’μœΌλ‘œ μ„€μ •

# μ›Œλ°μ—…
for _ in range(3):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet(*inputs)

# 좔적
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")


# μ›Œλ°μ—… 및 κ·Έλž˜ν”„ μ΅œμ ν™”
for _ in range(5):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet_traced(*inputs)


# λ²€μΉ˜λ§ˆν‚Ή
with torch.inference_mode():
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet_traced(*inputs)
        torch.cuda.synchronize()
        print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet(*inputs)
        torch.cuda.synchronize()
        print(f"unet inference took {time.time() - start_time:.2f} seconds")

# λͺ¨λΈ μ €μž₯
unet_traced.save("unet_traced.pt")
```

κ·Έ λ‹€μŒ, νŒŒμ΄ν”„λΌμΈμ˜ `unet` νŠΉμ„±μ„ λ‹€μŒκ³Ό 같이 μΆ”μ λœ λͺ¨λΈλ‘œ λ°”κΏ€ 수 μžˆμŠ΅λ‹ˆλ‹€.

```python
from diffusers import StableDiffusionPipeline
import torch
from dataclasses import dataclass


@dataclass
class UNet2DConditionOutput:
    sample: torch.Tensor


pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# jitted unet μ‚¬μš©
unet_traced = torch.jit.load("unet_traced.pt")


# pipe.unet μ‚­μ œ
class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.in_channels = pipe.unet.config.in_channels
        self.device = pipe.unet.device

    def forward(self, latent_model_input, t, encoder_hidden_states):
        sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)


pipe.unet = TracedUNet()

with torch.inference_mode():
    image = pipe([prompt] * 1, num_inference_steps=50).images[0]
```


## Memory-efficient attention

μ–΄ν…μ…˜ λΈ”λ‘μ˜ λŒ€μ—­ν­μ„ μ΅œμ ν™”ν•˜λŠ” 졜근 μž‘μ—…μœΌλ‘œ GPU λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 크게 ν–₯μƒλ˜κ³  ν–₯μƒλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
@tridao의 κ°€μž₯ 졜근의 ν”Œλž˜μ‹œ μ–΄ν…μ…˜: [code](https://github.com/HazyResearch/flash-attention), [paper](https://arxiv.org/pdf/2205.14135.pdf).

배치 크기 1(ν”„λ‘¬ν”„νŠΈ 1개)의 512x512 크기둜 좔둠을 μ‹€ν–‰ν•  λ•Œ λͺ‡ 가지 Nvidia GPUμ—μ„œ 얻은 속도 ν–₯상은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

| GPU              	| κΈ°μ€€ μ–΄ν…μ…˜ FP16 	       | λ©”λͺ¨λ¦¬ 효율적인 μ–΄ν…μ…˜ FP16 	|
|------------------	|---------------------	|---------------------------------	|
| NVIDIA Tesla T4  	| 3.5it/s             	| 5.5it/s                         	|
| NVIDIA 3060 RTX  	| 4.6it/s             	| 7.8it/s                         	|
| NVIDIA A10G      	| 8.88it/s            	| 15.6it/s                        	|
| NVIDIA RTX A6000 	| 11.7it/s            	| 21.09it/s                       	|
| NVIDIA TITAN RTX  | 12.51it/s         	| 18.22it/s                       	|
| A100-SXM4-40GB    	| 18.6it/s            	| 29.it/s                        	|
| A100-SXM-80GB    	| 18.7it/s            	| 29.5it/s                        	|

이λ₯Ό ν™œμš©ν•˜λ €λ©΄ λ‹€μŒμ„ λ§Œμ‘±ν•΄μ•Ό ν•©λ‹ˆλ‹€:
 - PyTorch > 1.12
 - Cuda μ‚¬μš© κ°€λŠ₯
 - [xformers 라이브러리λ₯Ό μ„€μΉ˜ν•¨](xformers)
```python
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

pipe.enable_xformers_memory_efficient_attention()

with torch.inference_mode():
    sample = pipe("a small cat")

# 선택: 이λ₯Ό λΉ„ν™œμ„±ν™” ν•˜κΈ° μœ„ν•΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
# pipe.disable_xformers_memory_efficient_attention()
```