Diffusers
Safetensors
English
AmusedPipeline
art
patrickvonplaten commited on
Commit
1d11dd5
·
1 Parent(s): 6e8b570

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -20,7 +20,20 @@ tags:
20
 
21
  Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
22
 
23
- Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder clip instead of t5. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## 1. Usage
26
 
 
20
 
21
  Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
22
 
23
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/97ca2Vqm7jBfCAzq20TtF.png)
24
+
25
+ *The diagram shows the training and inference pipelines for aMUSEd. aMUSEd consists
26
+ of three separately trained components: a pre-trained CLIP-L/14 text encoder, a VQ-GAN, and a
27
+ U-ViT. During training, the VQ-GAN encoder maps images to a 16x smaller latent resolution. The
28
+ proportion of masked latent tokens is sampled from a cosine masking schedule, e.g. cos(r · π
29
+ 2 )
30
+ with r ∼ Uniform(0, 1). The model is trained via cross-entropy loss to predict the masked tokens.
31
+ After the model is trained on 256x256 images, downsampling and upsampling layers are added, and
32
+ training is continued on 512x512 images. During inference, the U-ViT is conditioned on the text
33
+ encoder’s hidden states and iteratively predicts values for all masked tokens. The cosine masking
34
+ schedule determines a percentage of the most confident token predictions to be fixed after every
35
+ iteration. After 12 iterations, all tokens have been predicted and are decoded by the VQ-GAN into
36
+ image pixels.*
37
 
38
  ## 1. Usage
39