patrickvonplaten
commited on
Commit
•
a02a1d6
1
Parent(s):
bc59aa9
Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,20 @@ tags:
|
|
21 |
|
22 |
Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
|
23 |
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
## 1. Usage
|
27 |
|
@@ -220,6 +233,7 @@ Additionally, amused uses the smaller CLIP as its text encoder instead of T5 com
|
|
220 |
Flash attention is enabled by default in the diffusers codebase through torch `F.scaled_dot_product_attention`
|
221 |
|
222 |
### torch.compile
|
|
|
223 |
To use torch.compile, simply wrap the transformer in torch.compile i.e.
|
224 |
|
225 |
```python
|
|
|
21 |
|
22 |
Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
|
23 |
|
24 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/97ca2Vqm7jBfCAzq20TtF.png)
|
25 |
+
|
26 |
+
*The diagram shows the training and inference pipelines for aMUSEd. aMUSEd consists
|
27 |
+
of three separately trained components: a pre-trained CLIP-L/14 text encoder, a VQ-GAN, and a
|
28 |
+
U-ViT. During training, the VQ-GAN encoder maps images to a 16x smaller latent resolution. The
|
29 |
+
proportion of masked latent tokens is sampled from a cosine masking schedule, e.g. cos(r · π
|
30 |
+
2 )
|
31 |
+
with r ∼ Uniform(0, 1). The model is trained via cross-entropy loss to predict the masked tokens.
|
32 |
+
After the model is trained on 256x256 images, downsampling and upsampling layers are added, and
|
33 |
+
training is continued on 512x512 images. During inference, the U-ViT is conditioned on the text
|
34 |
+
encoder’s hidden states and iteratively predicts values for all masked tokens. The cosine masking
|
35 |
+
schedule determines a percentage of the most confident token predictions to be fixed after every
|
36 |
+
iteration. After 12 iterations, all tokens have been predicted and are decoded by the VQ-GAN into
|
37 |
+
image pixels.*
|
38 |
|
39 |
## 1. Usage
|
40 |
|
|
|
233 |
Flash attention is enabled by default in the diffusers codebase through torch `F.scaled_dot_product_attention`
|
234 |
|
235 |
### torch.compile
|
236 |
+
|
237 |
To use torch.compile, simply wrap the transformer in torch.compile i.e.
|
238 |
|
239 |
```python
|