Update README.md
Browse files
README.md
CHANGED
@@ -10,20 +10,19 @@ tags:
|
|
10 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
|
11 |
|
12 |
## Würstchen - Overview
|
13 |
-
Würstchen is diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
|
14 |
computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
|
15 |
-
use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through
|
16 |
-
compression. This was unseen before
|
17 |
-
two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
|
18 |
-
A third model, Stage C, is
|
19 |
also cheaper and faster inference.
|
20 |
|
21 |
## Würstchen - Decoder
|
22 |
-
The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image
|
23 |
-
and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
|
24 |
decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
|
25 |
|
26 |
-
**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions,
|
27 |
us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
|
28 |
|
29 |
### Image Sizes
|
@@ -32,7 +31,7 @@ We also observed that the Prior (Stage C) adapts extremely fast to new resolutio
|
|
32 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/IfVsUDcP15OY-5wyLYKnQ.jpeg" width=1000>
|
33 |
|
34 |
## How to run
|
35 |
-
This pipeline should be run together with a prior https://huggingface.co/warp-
|
36 |
|
37 |
```py
|
38 |
import torch
|
|
|
10 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
|
11 |
|
12 |
## Würstchen - Overview
|
13 |
+
Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
|
14 |
computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
|
15 |
+
use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
|
16 |
+
compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a
|
17 |
+
two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
|
18 |
+
A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
|
19 |
also cheaper and faster inference.
|
20 |
|
21 |
## Würstchen - Decoder
|
22 |
+
The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
|
|
|
23 |
decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
|
24 |
|
25 |
+
**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
|
26 |
us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
|
27 |
|
28 |
### Image Sizes
|
|
|
31 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/IfVsUDcP15OY-5wyLYKnQ.jpeg" width=1000>
|
32 |
|
33 |
## How to run
|
34 |
+
This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:
|
35 |
|
36 |
```py
|
37 |
import torch
|