File size: 7,756 Bytes
87d40d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Token Merging (토큰 병합)

Token Merging (introduced in [Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461))은 트랜슀포머 기반 λ„€νŠΈμ›Œν¬μ˜ forward passμ—μ„œ 쀑볡 ν† ν°μ΄λ‚˜ 패치λ₯Ό μ μ§„μ μœΌλ‘œ λ³‘ν•©ν•˜λŠ” λ°©μ‹μœΌλ‘œ μž‘λ™ν•©λ‹ˆλ‹€. 이λ₯Ό 톡해 기반 λ„€νŠΈμ›Œν¬μ˜ μΆ”λ‘  지연 μ‹œκ°„μ„ 단좕할 수 μžˆμŠ΅λ‹ˆλ‹€.

Token Merging(ToMe)이 μΆœμ‹œλœ ν›„, μ €μžλ“€μ€ [Fast Stable Diffusion을 μœ„ν•œ 토큰 병합](https://arxiv.org/abs/2303.17604)을 λ°œν‘œν•˜μ—¬ Stable Diffusionκ³Ό 더 잘 ν˜Έν™˜λ˜λŠ” ToMe 버전을 μ†Œκ°œν–ˆμŠ΅λ‹ˆλ‹€. ToMeλ₯Ό μ‚¬μš©ν•˜λ©΄ [`DiffusionPipeline`]의 μΆ”λ‘  지연 μ‹œκ°„μ„ λΆ€λ“œλŸ½κ²Œ 단좕할 수 μžˆμŠ΅λ‹ˆλ‹€. 이 λ¬Έμ„œμ—μ„œλŠ” ToMeλ₯Ό [`StableDiffusionPipeline`]에 μ μš©ν•˜λŠ” 방법, μ˜ˆμƒλ˜λŠ” 속도 ν–₯상, [`StableDiffusionPipeline`]μ—μ„œ ToMeλ₯Ό μ‚¬μš©ν•  λ•Œμ˜ 질적 츑면에 λŒ€ν•΄ μ„€λͺ…ν•©λ‹ˆλ‹€.

## ToMe μ‚¬μš©ν•˜κΈ°

ToMe의 μ €μžλ“€μ€ [`tomesd`](https://github.com/dbolya/tomesd)λΌλŠ” νŽΈλ¦¬ν•œ Python 라이브러리λ₯Ό κ³΅κ°œν–ˆλŠ”λ°, 이 라이브러리λ₯Ό μ΄μš©ν•˜λ©΄ [`DiffusionPipeline`]에 ToMeλ₯Ό λ‹€μŒκ³Ό 같이 μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```diff
from diffusers import StableDiffusionPipeline
import tomesd

pipeline = StableDiffusionPipeline.from_pretrained(
      "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
+ tomesd.apply_patch(pipeline, ratio=0.5)

image = pipeline("a photo of an astronaut riding a horse on mars").images[0]
```

이것이 λ‹€μž…λ‹ˆλ‹€!

`tomesd.apply_patch()`λŠ” νŒŒμ΄ν”„λΌμΈ μΆ”λ‘  속도와 μƒμ„±λœ ν† ν°μ˜ ν’ˆμ§ˆ μ‚¬μ΄μ˜ κ· ν˜•μ„ 맞좜 수 μžˆλ„λ‘ [μ—¬λŸ¬ 개의 인자](https://github.com/dbolya/tomesd#usage)λ₯Ό λ…ΈμΆœν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ 인수 쀑 κ°€μž₯ μ€‘μš”ν•œ 것은 `ratio(λΉ„μœ¨)`μž…λ‹ˆλ‹€. `ratio`은 forward pass 쀑에 병합될 ν† ν°μ˜ 수λ₯Ό μ œμ–΄ν•©λ‹ˆλ‹€. `tomesd`에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ ν•΄λ‹Ή 리포지토리(https://github.com/dbolya/tomesd) 및 [λ…Όλ¬Έ](https://arxiv.org/abs/2303.17604)을 μ°Έκ³ ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.

## `StableDiffusionPipeline`으둜 `tomesd` λ²€μΉ˜λ§ˆν‚Ήν•˜κΈ°

We benchmarked the impact of using `tomesd` on [`StableDiffusionPipeline`] along with [xformers](https://huggingface.co/docs/diffusers/optimization/xformers) across different image resolutions. We used A100 and V100 as our test GPU devices with the following development environment (with Python 3.8.5):
λ‹€μ–‘ν•œ 이미지 ν•΄μƒλ„μ—μ„œ [xformers](https://huggingface.co/docs/diffusers/optimization/xformers)λ₯Ό μ μš©ν•œ μƒνƒœμ—μ„œ, [`StableDiffusionPipeline`]에 `tomesd`λ₯Ό μ‚¬μš©ν–ˆμ„ λ•Œμ˜ 영ν–₯을 λ²€μΉ˜λ§ˆν‚Ήν–ˆμŠ΅λ‹ˆλ‹€. ν…ŒμŠ€νŠΈ GPU μž₯치둜 A100κ³Ό V100을 μ‚¬μš©ν–ˆμœΌλ©° 개발 ν™˜κ²½μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€(Python 3.8.5 μ‚¬μš©):

```bash
- `diffusers` version: 0.15.1
- Python version: 3.8.16
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- Huggingface_hub version: 0.13.2
- Transformers version: 4.27.2
- Accelerate version: 0.18.0
- xFormers version: 0.0.16
- tomesd version: 0.1.2
```

λ²€μΉ˜λ§ˆν‚Ήμ—λŠ” λ‹€μŒ 슀크립트λ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€: [https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

### A100

| 해상도 | 배치 크기 | Vanilla | ToMe | ToMe + xFormers | ToMe 속도 ν–₯상 (%) | ToMe + xFormers 속도 ν–₯상 (%) |
| --- | --- | --- | --- | --- | --- | --- |
| 512 | 10 | 6.88 | 5.26 | 4.69 | 23.54651163 | 31.83139535 |
|  |  |  |  |  |  |  |
| 768 | 10 | OOM | 14.71 | 11 |  |  |
|  | 8 | OOM | 11.56 | 8.84 |  |  |
|  | 4 | OOM | 5.98 | 4.66 |  |  |
|  | 2 | 4.99 | 3.24 | 3.1 | 35.07014028 | 37.8757515 |
|  | 1 | 3.29 | 2.24 | 2.03 | 31.91489362 | 38.29787234 |
|  |  |  |  |  |  |  |
| 1024 | 10 | OOM | OOM | OOM |  |  |
|  | 8 | OOM | OOM | OOM |  |  |
|  | 4 | OOM | 12.51 | 9.09 |  |  |
|  | 2 | OOM | 6.52 | 4.96 |  |  |
|  | 1 | 6.4 | 3.61 | 2.81 | 43.59375 | 56.09375 |

***κ²°κ³ΌλŠ” 초 λ‹¨μœ„μž…λ‹ˆλ‹€. 속도 ν–₯상은 `Vanilla`κ³Ό 비ꡐ해 κ³„μ‚°λ©λ‹ˆλ‹€.***

### V100

| 해상도 | 배치 크기 | Vanilla | ToMe | ToMe + xFormers | ToMe 속도 ν–₯상 (%) | ToMe + xFormers 속도 ν–₯상 (%) |
| --- | --- | --- | --- | --- | --- | --- |
| 512 | 10 | OOM | 10.03 | 9.29 |  |  |
|  | 8 | OOM | 8.05 | 7.47 |  |  |
|  | 4 | 5.7 | 4.3 | 3.98 | 24.56140351 | 30.1754386 |
|  | 2 | 3.14 | 2.43 | 2.27 | 22.61146497 | 27.70700637 |
|  | 1 | 1.88 | 1.57 | 1.57 | 16.4893617 | 16.4893617 |
|  |  |  |  |  |  |  |
| 768 | 10 | OOM | OOM | 23.67 |  |  |
|  | 8 | OOM | OOM | 18.81 |  |  |
|  | 4 | OOM | 11.81 | 9.7 |  |  |
|  | 2 | OOM | 6.27 | 5.2 |  |  |
|  | 1 | 5.43 | 3.38 | 2.82 | 37.75322284 | 48.06629834 |
|  |  |  |  |  |  |  |
| 1024 | 10 | OOM | OOM | OOM |  |  |
|  | 8 | OOM | OOM | OOM |  |  |
|  | 4 | OOM | OOM | 19.35 |  |  |
|  | 2 | OOM | 13 | 10.78 |  |  |
|  | 1 | OOM | 6.66 | 5.54 |  |  |

μœ„μ˜ ν‘œμ—μ„œ λ³Ό 수 μžˆλ“―μ΄, 이미지 해상도가 λ†’μ„μˆ˜λ‘ `tomesd`λ₯Ό μ‚¬μš©ν•œ 속도 ν–₯상이 λ”μš± λ‘λ“œλŸ¬μ§‘λ‹ˆλ‹€. λ˜ν•œ `tomesd`λ₯Ό μ‚¬μš©ν•˜λ©΄ 1024x1024와 같은 더 높은 ν•΄μƒλ„μ—μ„œ νŒŒμ΄ν”„λΌμΈμ„ μ‹€ν–‰ν•  수 μžˆλ‹€λŠ” 점도 ν₯λ―Έλ‘­μŠ΅λ‹ˆλ‹€. 

[`torch.compile()`](https://huggingface.co/docs/diffusers/optimization/torch2.0)을 μ‚¬μš©ν•˜λ©΄ μΆ”λ‘  속도λ₯Ό λ”μš± 높일 수 μžˆμŠ΅λ‹ˆλ‹€. 

## ν’ˆμ§ˆ

As reported in [the paper](https://arxiv.org/abs/2303.17604), ToMe can preserve the quality of the generated images to a great extent while speeding up inference. By increasing the `ratio`, it is possible to further speed up inference, but that might come at the cost of a deterioration in the image quality. 

To test the quality of the generated samples using our setup, we sampled a few prompts from the β€œParti Prompts” (introduced in [Parti](https://parti.research.google/)) and performed inference with the [`StableDiffusionPipeline`] in the following settings:

[λ…Όλ¬Έ](https://arxiv.org/abs/2303.17604)에 보고된 바와 같이, ToMeλŠ” μƒμ„±λœ μ΄λ―Έμ§€μ˜ ν’ˆμ§ˆμ„ 상당 λΆ€λΆ„ λ³΄μ‘΄ν•˜λ©΄μ„œ μΆ”λ‘  속도λ₯Ό 높일 수 μžˆμŠ΅λ‹ˆλ‹€. `ratio`을 높이면 μΆ”λ‘  속도λ₯Ό 더 높일 수 μžˆμ§€λ§Œ, 이미지 ν’ˆμ§ˆμ΄ μ €ν•˜λ  수 μžˆμŠ΅λ‹ˆλ‹€. 

ν•΄λ‹Ή 섀정을 μ‚¬μš©ν•˜μ—¬ μƒμ„±λœ μƒ˜ν”Œμ˜ ν’ˆμ§ˆμ„ ν…ŒμŠ€νŠΈν•˜κΈ° μœ„ν•΄, "Parti ν”„λ‘¬ν”„νŠΈ"([Parti](https://parti.research.google/)μ—μ„œ μ†Œκ°œ)μ—μ„œ λͺ‡ 가지 ν”„λ‘¬ν”„νŠΈλ₯Ό μƒ˜ν”Œλ§ν•˜κ³  λ‹€μŒ μ„€μ •μ—μ„œ [`StableDiffusionPipeline`]을 μ‚¬μš©ν•˜μ—¬ 좔둠을 μˆ˜ν–‰ν–ˆμŠ΅λ‹ˆλ‹€:

- Vanilla [`StableDiffusionPipeline`]
- [`StableDiffusionPipeline`] + ToMe
- [`StableDiffusionPipeline`] + ToMe + xformers

μƒμ„±λœ μƒ˜ν”Œμ˜ ν’ˆμ§ˆμ΄ 크게 μ €ν•˜λ˜λŠ” 것을 λ°œκ²¬ν•˜μ§€ λͺ»ν–ˆμŠ΅λ‹ˆλ‹€. λ‹€μŒμ€ μƒ˜ν”Œμž…λ‹ˆλ‹€: 

![tome-samples](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/tome/tome_samples.png)

μƒμ„±λœ μƒ˜ν”Œμ€ [μ—¬κΈ°](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=)μ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. 이 μ‹€ν—˜μ„ μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ [이 슀크립트](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd)λ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.