File size: 18,592 Bytes
c19ca42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
# Benchmark

To run standardized benchmark, you can use UI -> System -> Benchmark feature
or via CLI using `cli/run-benchmark.py` script.

It runs identical tests, but often CLI is faster due to lower overhead.

## Environment

- Hardware: nVidia RTX 4090 with i9-13900KF
- Packages: Torch 2.1.0 with CUDA 12.1 and cuDNN 8.9
- Params: model=SD15 | batch-size=4 | batch-count=4 | steps=50 | resolution=512px | sampler=Euler A

## Results

Basic tests using UI:

|||Diffusers||Original|||
|---|---|---|---|---|---|---|
|Precision|Params|SDP|xFormers|SDP|xFormers|None|
|FP32|Default|33.0|20.0||||
|BF16|Default|73.0|45.5||||
|FP16|Default|73.0|75.0|48.0|48.6|17.3|
||NHWC (channels last)|72.0|||||
||HyperTile (256)|79.0|||||
||ToMe (0.5)|77.0|||||
||Model no-move (medvram)|85.0|||||
||VAE no-slicing, no-tiling|73.8|||||
||Sequential offload (lowvram)|27.0|||||

### Notes: Options

- All numbers are in **it/s** and higher is better  
- Test matrix is not full as some options can be combined together *(e.g. cuDNN + HyperTile)*  
  while others cannot *(e.g. HyperTile + ToMe)*
- Results may differ on different GPU/CPU combinations  
  For example, pairing better CPU with older GPU may benefit from more processing done on CPU and leaving GPU to do only core ML tasks while paring high-end GPU with older CPU may result in lower results since CPU cannot feed enough tasks to GPU
- Diffusers perform significantly better than original backend on modern hardware since tasks remain on GPU for longer time  
  Equally, original backend may perform better on older hardware
- Running quick tasks such as single image generate at low steps may not be sufficient to fully saturate high-end GPU so results will be lower
- xFormers have a slight performance advantage over SDP  
  However, SDP is a built-in in Torch and "just works" while xFormers needs manual install and its highly version dependent  
- Some extensions can add significant overhead to pre/post processing even if they are not used
- Not worth consideration: *cuDNN*, *NHWC*, *inference mode*, *eval*
  - cuDNN full bench finds best math algorithm for specific GPU, but default is nearly identical
  - channels-last should better trigger utilization of tensor cores, but in practise result is nearly identical
  - inference-mode should have more optimizations than default no_grad, but in practise result is nearly identical
  - eval mode should allow for removal of some params in the model, but in pracise result is nearly identical
- Benefit of *BF16* vs *FP16* is not performance as much, its ability to run higher numerical ranges so it can perform calculations where *FP16* may result in *NaN*
- Running in *FP32* results in 60% performance drop - if you need *FP32*, you're leaving a lot on the table
- Cost of using lowvram is very high as it needs to swap parts of model in-memory. Even using medvram comes at noticeable cost
- Best: xFormers, FP16, HyperTile, no-model-move, no-slicing/tiling

## Compile

|Compile type|Performance|Overhead|
|---|---|---|
|cudnn/default|73.5|4|
|inductor/default|89.0|40|
|inductor/reduce-overhead|92.0|40|
|inductor/max-autotune|91.0|220|
|nvfuser/default|84.0|5|
|cudagraphs/reduce-overhead|85.0|14|
|stable-fast/sdp|96.0|76|
|stable-fast/xformers|96.0|101|
|stable-fast/full-graph|94.0|96|

### Notes: Compile

- Performance numbers is in **it/s** and higher is better  
- Overhead is time in seconds needed to optimize a model with specific params and lower is better  
  Model needs compile on initial generate, but it may also need a recompile if params such as resolution of batch size change
- Model compile may not be compatible with any method that modifies underlying model,  
  including loading Lora weights on top of a model  
- [stable-fast](https://github.com/chengzeyi/stable-fast) compile backend requires that package is manually installed on the system


# Intel ARC

## Environment

- Hardware: Intel ARC 770 LE 16GB with R7 5800X3D & MSI B350M Mortar (PCI-E 3.0) & 48 GB 3200 MHz CL18 RAM  
- OS: Arch Linux with this Docker environment: https://github.com/Disty0/docker-sdnext-ipex
- Packages: Torch 2.1.0a0+cxx11.abi with IPEX 2.1.10+xpu and MKL / DPCPP 2024.0.0
- Params: model=SD15 | batch-size=1 | batch-count=1 | steps=40 | resolution=512px | sampler=Euler a | CFG 6

## Results

|||Diffusers|Original|
|---|---|---|---|
|Precision|Params|it/s|it/s|
|BF16|Default|8.54|7.75|
|FP16|Default|6.92|7.23|
|FP32|Default|3.73|3.74|
|BF16|HyperTile (256)|10.03|9.32|
|BF16|ToMe (0.5)|9.24|8.61|
|BF16|No IPEX Optimize|8.23|7.82|
|BF16|Model no-move (medvram)|9.04||
|BF16|VAE no-slicing, no-tiling|8.67||
|BF16|Sequential offload (lowvram)|1.60|0.67|

## API Benchmarks

```
2024-02-07 22:52:56,406 INFO: {'run-benchmark'}
2024-02-07 22:52:56,407 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-07 22:52:56,432 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': '659ad2e7', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-07 22:52:56,434 INFO: {'platform': {'arch': 'x86_64', 'cpu': '', 'system': 'Linux', 'release': '6.7.3-arch1-2', 'python': '3.11.6', 'torch': '2.1.0a0+cxx11.abi', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-07 22:52:56,437 INFO: {'model': 'SD1.5/SoteMixV3 [dcc16969a0]'}
2024-02-07 22:52:56,441 INFO: {'system': {'cpu': {'free': 48901079040.00001, 'used': 1533939712, 'total': 50435018752.00001}, 'gpu': {'system': {'free': 17079205888, 'used': 0, 'total': 17079205888}, 'session': {'current': 0, 'peak': 0}}}}
2024-02-07 22:52:56,441 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16, 24, 32]}
2024-02-07 22:53:10,362 INFO: {'warmup': 13.92}
2024-02-07 22:53:18,182 INFO: {'batch': 1, 'its': 12.81, 'img': 3.9, 'wall': 3.9, 'peak': 2.61, 'oom': False}
2024-02-07 22:53:31,723 INFO: {'batch': 2, 'its': 15.49, 'img': 3.23, 'wall': 6.45, 'peak': 3.07, 'oom': False}
2024-02-07 22:53:55,512 INFO: {'batch': 4, 'its': 17.18, 'img': 2.91, 'wall': 11.64, 'peak': 3.07, 'oom': False}
2024-02-07 22:54:39,504 INFO: {'batch': 8, 'its': 18.4, 'img': 2.72, 'wall': 21.74, 'peak': 3.07, 'oom': False}
2024-02-07 22:55:43,500 INFO: {'batch': 12, 'its': 18.93, 'img': 2.64, 'wall': 31.7, 'peak': 3.07, 'oom': False}
2024-02-07 22:56:58,086 INFO: {'batch': 16, 'its': 21.61, 'img': 2.31, 'wall': 37.01, 'peak': 3.07, 'oom': False}
2024-02-07 22:58:48,560 INFO: {'batch': 24, 'its': 21.92, 'img': 2.28, 'wall': 54.74, 'peak': 3.64, 'oom': False}
2024-02-07 23:01:09,184 INFO: {'batch': 32, 'its': 22.82, 'img': 2.19, 'wall': 70.12, 'peak': 4.06, 'oom': False}
```

# OpenVINO

## Environment

- Hardware: Intel ARC 770 LE 16GB with R7 5800X3D & MSI B350M Mortar (PCI-E 3.0) & 48 GB 3200 MHz CL18 RAM  
- OS: Arch Linux  
- Packages: Torch 2.1.2+cpu and OpenVINO 2023.2.0
- Params: model=SD15 | batch-size=1 | batch-count=1 | steps=20 | resolution=512px | sampler=Euler a | CFG 6

## GPU Results

|||Diffusers|
|---|---|---|
|Precision|Params|it/s|
|Default|Default|9.21|

## CPU Results

|||Diffusers|
|---|---|---|
|Precision|Params|s/it|
|Default|Default|3.00|
|Default|LCM & CFG 0|1.60|
|INT8|Default|3.30|
|INT4_SYM|Default|4.00|
|INT4_ASYM|Default|4.30|
|NF4|Default|5.25|
|FP32|Diffusers & No OpenVINO|4.20|

# DirectML

- Hardware: Intel Core i9-14900K, SAPPHIRE AMD Radeon RX 7900 XTX NITRO+ Vapor-X 24GB, SAMSUNG DDR5 32GBx4
- Operating System: Windows 11 Build 22631
- Packages: PyTorch 2.0.0 (built with CPU), torch-directml 0.2.0.dev230426
- Performed using `cli/run-benchmark.py` script

Peak: 9.36 with batch size 8.

Possible max batch size: 12 (Slow with 12, OOM with 16)

```log
2024-02-08 20:09:31,923 INFO: {'run-benchmark'}
2024-02-08 20:09:31,924 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-08 20:09:32,005 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': '659ad2e7', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-08 20:09:32,007 INFO: {'platform': {'arch': 'AMD64', 'cpu': 'Intel64 Family 6 Model 183 Stepping 1, GenuineIntel', 'system': 'Windows', 'release': 'Windows-10-10.0.22631-SP0', 'python': '3.10.11', 'torch': '2.0.0+cpu', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-08 20:09:32,013 INFO: {'system': {'cpu': {'free': 136382431232.00002, 'used': 708612096, 'total': 137091043328.00002}, 'gpu': {'error': 'unavailable'}}}
2024-02-08 20:09:32,013 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16]}
2024-02-08 20:09:51,463 INFO: {'warmup': 19.45}
2024-02-08 20:10:03,837 INFO: {'batch': 1, 'its': 8.06, 'img': 6.2, 'wall': 6.2, 'peak': 0.0, 'oom': False}
2024-02-08 20:10:27,845 INFO: {'batch': 2, 'its': 9.02, 'img': 5.54, 'wall': 11.09, 'peak': 0.0, 'oom': False}
2024-02-08 20:11:12,886 INFO: {'batch': 4, 'its': 9.04, 'img': 5.53, 'wall': 22.12, 'peak': 0.0, 'oom': False}
2024-02-08 20:12:38,582 INFO: {'batch': 8, 'its': 9.36, 'img': 5.34, 'wall': 42.76, 'peak': 0.0, 'oom': False}
2024-02-08 20:15:22,610 INFO: {'batch': 12, 'its': 7.31, 'img': 6.84, 'wall': 82.04, 'peak': 0.0, 'oom': False}
2024-02-08 20:15:23,465 ERROR: {'requested': 16, 'received': 0}
2024-02-08 20:15:24,161 ERROR: {'requested': 16, 'received': 0}
2024-02-08 20:15:24,164 INFO: {'batch': 16, 'its': 1150.12, 'img': 0.04, 'wall': 0.7, 'peak': 0.0, 'oom': False}
```

# ONNX Runtime

- Hardware: Intel Core i9-14900K, SAPPHIRE AMD Radeon RX 7900 XTX NITRO+ Vapor-X 24GB, SAMSUNG DDR5 32GBx4
- Operating System: Windows 11 Build 22631
- Packages: PyTorch 2.2.0 (built with CPU), onnxruntime 1.17.0, onnxruntime-directml 1.17.0
- Performed using `cli/run-benchmark.py` script

Peak: 17.58

Possible max batch size: 8 (Not OOM, but very slow with 12 or higher)

```log
2024-02-08 19:20:45,235 INFO: {'run-benchmark'}
2024-02-08 19:20:45,236 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-08 19:20:45,317 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': '659ad2e7', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-08 19:20:45,318 INFO: {'platform': {'arch': 'AMD64', 'cpu': 'Intel64 Family 6 Model 183 Stepping 1, GenuineIntel', 'system': 'Windows', 'release': 'Windows-10-10.0.22631-SP0', 'python': '3.10.12', 'torch': '2.2.0+cpu', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-08 19:20:45,324 INFO: {'system': {'cpu': {'free': 136392728576.00002, 'used': 698314752, 'total': 137091043328.00002}, 'gpu': {'error': 'unavailable'}}}
2024-02-08 19:20:45,324 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16]}
2024-02-08 19:21:03,553 INFO: {'warmup': 18.23}
2024-02-08 19:21:12,036 INFO: {'batch': 1, 'its': 11.81, 'img': 4.23, 'wall': 4.23, 'peak': 0.0, 'oom': False}
2024-02-08 19:21:26,618 INFO: {'batch': 2, 'its': 13.79, 'img': 3.62, 'wall': 7.25, 'peak': 0.0, 'oom': False}
2024-02-08 19:21:54,400 INFO: {'batch': 4, 'its': 14.46, 'img': 3.46, 'wall': 13.83, 'peak': 0.0, 'oom': False}
2024-02-08 19:22:40,407 INFO: {'batch': 8, 'its': 17.58, 'img': 2.84, 'wall': 22.75, 'peak': 0.0, 'oom': False}
2024-02-08 19:30:30,903 INFO: {'batch': 12, 'its': 2.56, 'img': 19.57, 'wall': 234.8, 'peak': 0.0, 'oom': False}
2024-02-08 19:40:08,391 INFO: {'batch': 16, 'its': 2.77, 'img': 18.05, 'wall': 288.86, 'peak': 0.0, 'oom': False}
```

## With optimized model using Olive

- Package: olive-ai 0.4.0

Peak: **54.08**

Possible max batch size: Unknown (at least 48)

```log
2024-02-08 18:51:28,096 INFO: {'run-benchmark'}
2024-02-08 18:51:28,097 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-08 18:51:28,167 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': '659ad2e7', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-08 18:51:28,168 INFO: {'platform': {'arch': 'AMD64', 'cpu': 'Intel64 Family 6 Model 183 Stepping 1, GenuineIntel', 'system': 'Windows', 'release': 'Windows-10-10.0.22631-SP0', 'python': '3.10.12', 'torch': '2.2.0+cpu', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-08 18:51:28,174 INFO: {'system': {'cpu': {'free': 136385822719.99998, 'used': 705220608, 'total': 137091043327.99998}, 'gpu': {'error': 'unavailable'}}}
2024-02-08 18:51:28,174 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16]}
2024-02-08 18:51:42,445 INFO: {'warmup': 14.27}
2024-02-08 18:51:46,603 INFO: {'batch': 1, 'its': 23.63, 'img': 2.12, 'wall': 2.12, 'peak': 0.0, 'oom': False}
2024-02-08 18:52:00,527 INFO: {'batch': 2, 'its': 35.06, 'img': 1.43, 'wall': 2.85, 'peak': 0.0, 'oom': False}
2024-02-08 18:52:18,711 INFO: {'batch': 4, 'its': 40.34, 'img': 1.24, 'wall': 4.96, 'peak': 0.0, 'oom': False}
2024-02-08 18:52:42,958 INFO: {'batch': 8, 'its': 50.51, 'img': 0.99, 'wall': 7.92, 'peak': 0.0, 'oom': False}
2024-02-08 18:53:13,677 INFO: {'batch': 12, 'its': 53.81, 'img': 0.93, 'wall': 11.15, 'peak': 0.0, 'oom': False}
2024-02-08 18:53:51,700 INFO: {'batch': 16, 'its': 54.08, 'img': 0.92, 'wall': 14.79, 'peak': 0.0, 'oom': False}
```

## API Benchmarks

Using latest version of **SD.Next** with **Torch 2.2.0**, **CUDA 12.1**  
*Note*: Usage of SD.Next via API is faster than via UI due to lower overhead.

Environment: **Intel i9-13900KF** platform with **nVidia RTX 4090 GPU**  

As you can see, we're reaching peak performance of **~110 it/s** using simple settings:  

```log
vlado@wsl:~/dev/sdnext-dev $ python cli/run-benchmark.py --maxbatch 32
2024-02-07 11:19:53,026 INFO: {'run-benchmark'}
2024-02-07 11:19:53,027 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-07 11:19:53,046 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': 'd967bd03', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-07 11:19:53,048 INFO: {'platform': {'arch': 'x86_64', 'cpu': 'x86_64', 'system': 'Linux', 'release': '5.15.146.1-microsoft-standard-WSL2', 'python': '3.11.1', 'torch': '2.2.0+cu121', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-07 11:19:53,051 INFO: {'model': 'sd15/lyriel-v16 [ec6f68ea63]'}
2024-02-07 11:19:53,054 INFO: {'system': {'cpu': {'free': 49020043264.0, 'used': 1495736320, 'total': 50515779584.0}, 'gpu': {'system': {'free': 24110956544, 'used': 1645740032, 'total': 25756696576}, 'session': {'current': 0, 'peak': 0}}}}
2024-02-07 11:19:53,054 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16, 24, 32]}
2024-02-07 11:19:59,394 INFO: {'warmup': 6.34}
2024-02-07 11:20:02,354 INFO: {'batch': 1, 'its': 33.63, 'img': 1.49, 'wall': 1.49, 'peak': 7.05, 'oom': False}
2024-02-07 11:20:06,213 INFO: {'batch': 2, 'its': 64.3, 'img': 0.78, 'wall': 1.56, 'peak': 7.1, 'oom': False}
2024-02-07 11:20:11,293 INFO: {'batch': 4, 'its': 90.87, 'img': 0.55, 'wall': 2.2, 'peak': 7.18, 'oom': False}
2024-02-07 11:20:19,416 INFO: {'batch': 8, 'its': 104.6, 'img': 0.48, 'wall': 3.82, 'peak': 7.18, 'oom': False}
2024-02-07 11:20:30,850 INFO: {'batch': 12, 'its': 111.96, 'img': 0.45, 'wall': 5.36, 'peak': 7.18, 'oom': False}
2024-02-07 11:20:46,236 INFO: {'batch': 16, 'its': 110.37, 'img': 0.45, 'wall': 7.25, 'peak': 7.18, 'oom': False}
2024-02-07 11:21:09,338 INFO: {'batch': 24, 'its': 109.75, 'img': 0.46, 'wall': 10.93, 'peak': 7.18, 'oom': False}
2024-02-07 11:21:39,623 INFO: {'batch': 32, 'its': 111.38, 'img': 0.45, 'wall': 14.37, 'peak': 7.18, 'oom': False}
```

With a full optimizations and custom compiled **Stable-Fast**:  
We're reaching peak performance of **~150 it/s** (and **~165 it/s** using *TAESD* instead of full *VAE*):  

```log
vlado@wsl:~/dev/sdnext-dev $ python cli/run-benchmark.py --maxbatch 32
2024-02-07 11:29:23,431 INFO: {'run-benchmark'}
2024-02-07 11:29:23,432 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-07 11:29:23,451 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': 'd967bd03', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-07 11:29:23,453 INFO: {'platform': {'arch': 'x86_64', 'cpu': 'x86_64', 'system': 'Linux', 'release': '5.15.146.1-microsoft-standard-WSL2', 'python': '3.11.1', 'torch': '2.2.0+cu121', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-07 11:29:23,456 INFO: {'model': 'sd15/lyriel-v16 [ec6f68ea63]'}
2024-02-07 11:29:23,459 INFO: {'system': {'cpu': {'free': 49373564927.99999, 'used': 1142214656, 'total': 50515779583.99999}, 'gpu': {'system': {'free': 24110956544, 'used': 1645740032, 'total': 25756696576}, 'session': {'current': 0, 'peak': 0}}}}
2024-02-07 11:29:23,459 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16, 24, 32]}
2024-02-07 11:29:38,504 INFO: {'warmup': 15.04}
2024-02-07 11:29:38,965 INFO: {'batch': 1, 'its': 78.16, 'img': 0.67, 'wall': 0.23, 'peak': 7.11, 'oom': False}
2024-02-07 11:29:42,630 INFO: {'batch': 2, 'its': 98.91, 'img': 0.51, 'wall': 1.01, 'peak': 7.11, 'oom': False}
2024-02-07 11:29:47,192 INFO: {'batch': 4, 'its': 117.92, 'img': 0.42, 'wall': 1.7, 'peak': 7.11, 'oom': False}
2024-02-07 11:29:54,028 INFO: {'batch': 8, 'its': 142.42, 'img': 0.35, 'wall': 2.81, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:03,161 INFO: {'batch': 12, 'its': 153.29, 'img': 0.33, 'wall': 3.91, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:14,921 INFO: {'batch': 16, 'its': 153.41, 'img': 0.33, 'wall': 5.21, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:33,534 INFO: {'batch': 24, 'its': 144.65, 'img': 0.35, 'wall': 8.3, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:56,914 INFO: {'batch': 32, 'its': 150.59, 'img': 0.33, 'wall': 10.63, 'peak': 7.11, 'oom': False}
```

Additional performance may be reached by experimenting with different settings, but combination of such may lead to unstable results  
For example: *channels-last, hyper-tile, tomesd, fused-projections*