amused
/

amused-256

patrickvonplaten commited on Jan 4, 2024

Commit

1d11dd5

1 Parent(s): 6e8b570

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -20,7 +20,20 @@ tags:
 Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
-Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder clip instead of t5. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
 ## 1. Usage

 Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/97ca2Vqm7jBfCAzq20TtF.png)
+*The diagram shows the training and inference pipelines for aMUSEd. aMUSEd consists
+of three separately trained components: a pre-trained CLIP-L/14 text encoder, a VQ-GAN, and a
+U-ViT. During training, the VQ-GAN encoder maps images to a 16x smaller latent resolution. The
+proportion of masked latent tokens is sampled from a cosine masking schedule, e.g. cos(r · π
+2 )
+with r ∼ Uniform(0, 1). The model is trained via cross-entropy loss to predict the masked tokens.
+After the model is trained on 256x256 images, downsampling and upsampling layers are added, and
+training is continued on 512x512 images. During inference, the U-ViT is conditioned on the text
+encoder’s hidden states and iteratively predicts values for all masked tokens. The cosine masking
+schedule determines a percentage of the most confident token predictions to be fixed after every
+iteration. After 12 iterations, all tokens have been predicted and are decoded by the VQ-GAN into
+image pixels.*
 ## 1. Usage