patrickvonplaten
commited on
Commit
·
1d11dd5
1
Parent(s):
6e8b570
Update README.md
Browse files
README.md
CHANGED
@@ -20,7 +20,20 @@ tags:
|
|
20 |
|
21 |
Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
|
22 |
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
## 1. Usage
|
26 |
|
|
|
20 |
|
21 |
Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
|
22 |
|
23 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/97ca2Vqm7jBfCAzq20TtF.png)
|
24 |
+
|
25 |
+
*The diagram shows the training and inference pipelines for aMUSEd. aMUSEd consists
|
26 |
+
of three separately trained components: a pre-trained CLIP-L/14 text encoder, a VQ-GAN, and a
|
27 |
+
U-ViT. During training, the VQ-GAN encoder maps images to a 16x smaller latent resolution. The
|
28 |
+
proportion of masked latent tokens is sampled from a cosine masking schedule, e.g. cos(r · π
|
29 |
+
2 )
|
30 |
+
with r ∼ Uniform(0, 1). The model is trained via cross-entropy loss to predict the masked tokens.
|
31 |
+
After the model is trained on 256x256 images, downsampling and upsampling layers are added, and
|
32 |
+
training is continued on 512x512 images. During inference, the U-ViT is conditioned on the text
|
33 |
+
encoder’s hidden states and iteratively predicts values for all masked tokens. The cosine masking
|
34 |
+
schedule determines a percentage of the most confident token predictions to be fixed after every
|
35 |
+
iteration. After 12 iterations, all tokens have been predicted and are decoded by the VQ-GAN into
|
36 |
+
image pixels.*
|
37 |
|
38 |
## 1. Usage
|
39 |
|