amused
/

amused-512

Diffusers

Safetensors

English

AmusedPipeline

art

Model card Files Files and versions Community

patrickvonplaten commited on Jan 4, 2024

Commit

a02a1d6

•

1 Parent(s): bc59aa9

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -1

README.md CHANGED Viewed

@@ -21,7 +21,20 @@ tags:
 Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
-Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder clip instead of t5. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
 ## 1. Usage
@@ -220,6 +233,7 @@ Additionally, amused uses the smaller CLIP as its text encoder instead of T5 com
 Flash attention is enabled by default in the diffusers codebase through torch `F.scaled_dot_product_attention`
 ### torch.compile
 To use torch.compile, simply wrap the transformer in torch.compile i.e.
 ```python

 Amused is a lightweight text to image model based off of the [muse](https://arxiv.org/pdf/2301.00704.pdf) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/97ca2Vqm7jBfCAzq20TtF.png)
+*The diagram shows the training and inference pipelines for aMUSEd. aMUSEd consists
+of three separately trained components: a pre-trained CLIP-L/14 text encoder, a VQ-GAN, and a
+U-ViT. During training, the VQ-GAN encoder maps images to a 16x smaller latent resolution. The
+proportion of masked latent tokens is sampled from a cosine masking schedule, e.g. cos(r · π
+2 )
+with r ∼ Uniform(0, 1). The model is trained via cross-entropy loss to predict the masked tokens.
+After the model is trained on 256x256 images, downsampling and upsampling layers are added, and
+training is continued on 512x512 images. During inference, the U-ViT is conditioned on the text
+encoder’s hidden states and iteratively predicts values for all masked tokens. The cosine masking
+schedule determines a percentage of the most confident token predictions to be fixed after every
+iteration. After 12 iterations, all tokens have been predicted and are decoded by the VQ-GAN into
+image pixels.*
 ## 1. Usage
 Flash attention is enabled by default in the diffusers codebase through torch `F.scaled_dot_product_attention`
 ### torch.compile
 To use torch.compile, simply wrap the transformer in torch.compile i.e.
 ```python