|
# Stable Cascade |
|
<p align="center"> |
|
<img src="figures/collage_1.jpg" width="800"> |
|
</p> |
|
|
|
This is the official codebase for **Stable Cascade**. We provide training & inference scripts, as well as a variety of different models you can use. |
|
<br><br> |
|
This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main |
|
difference to other models, like Stable Diffusion, is that it is working at a much smaller latent space. Why is this |
|
important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes. |
|
How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being |
|
encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a |
|
1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the |
|
highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable |
|
Diffusion 1.5. <br> <br> |
|
Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions |
|
like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well. A few of those are |
|
already provided (finetuning, ControlNet, LoRA) in the [training](train) and [inference](inference) sections. |
|
|
|
Moreover, Stable Cascade achieves impressive results, both visually and evaluation wise. According to our evaluation, |
|
Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all comparisons. The above picture |
|
shows the results from a human evaluation using a mix of parti-prompts (link) and aesthetic prompts. Specifically, |
|
Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference steps), SDXL (50 inference steps), |
|
SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps). |
|
<br> |
|
<p align="center"> |
|
<img height="300" src="figures/comparison.png"/> |
|
</p> |
|
|
|
Stable Cascade´s focus on efficiency is evidenced through its architecture and a higher compressed latent space. |
|
Despite the largest model containing 1.4 billion parameters more than Stable Diffusion XL, it still features faster |
|
inference times, as can be seen in the figure below. |
|
|
|
<p align="center"> |
|
<img height="300" src="figures/comparison-inference-speed.jpg"/> |
|
</p> |
|
|
|
<hr> |
|
<p align="center"> |
|
<img src="figures/collage_2.jpg" width="800"> |
|
</p> |
|
|
|
## Model Overview |
|
Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade for generating images, |
|
hence the name "Stable Cascade". |
|
Stage A & B are used to compress images, similarly to what the job of the VAE is in Stable Diffusion. |
|
However, as mentioned before, with this setup a much higher compression of images can be achieved. Furthermore, Stage C |
|
is responsible for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually. |
|
Note that Stage A is a VAE and both Stage B & C are diffusion models. |
|
|
|
<p align="center"> |
|
<img src="figures/model-overview.jpg" width="600"> |
|
</p> |
|
|
|
For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with |
|
a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was |
|
put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve |
|
great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the |
|
best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to |
|
its small size. |
|
|
|
## Getting Started |
|
This section will briefly outline how you can get started with **Stable Cascade**. |
|
|
|
### Inference |
|
Running the model can be done through the notebooks provided in the [inference](inference) section. You will find more |
|
details regarding downloading the models, compute requirements as well as some tutorials on how to use the models. |
|
Specifically, there are four notebooks provided for the following use-cases: |
|
#### Text-to-Image |
|
A compact [notebook](inference/text_to_image.ipynb) that provides you with basic functionality for text-to-image, |
|
image-variation and image-to-image. |
|
- Text-to-Image |
|
|
|
`Cinematic photo of an anthropomorphic penguin sitting in a cafe reading a book and having a coffee.` |
|
<p align="center"> |
|
<img src="figures/text-to-image-example-penguin.jpg" width="800"> |
|
</p> |
|
|
|
- Image Variation |
|
|
|
The model can also understand image embeddings, which makes it possible to generate variations of a given image (left). |
|
There was no prompt given here. |
|
<p align="center"> |
|
<img src="figures/image-variations-example-headset.jpg" width="800"> |
|
</p> |
|
|
|
- Image-to-Image |
|
|
|
This works just as usual, by noising an image up to a specific point and then letting the model generate from that |
|
starting point. Here the left image is noised to 80% and the caption is: `A person riding a rodent.` |
|
<p align="center"> |
|
<img src="figures/image-to-image-example-rodent.jpg" width="800"> |
|
</p> |
|
|
|
Furthermore, the model is also accessible in the diffusers 🤗 library. You can find the documentation and usage [here](https://huggingface.co/stabilityai/stable-cascade). |
|
#### ControlNet |
|
This [notebook](inference/controlnet.ipynb) shows how to use ControlNets that were trained by us or how to use one that |
|
you trained yourself for Stable Cascade. With this release, we provide the following ControlNets: |
|
- Inpainting / Outpainting |
|
|
|
<p align="center"> |
|
<img src="figures/controlnet-paint.jpg" width="800"> |
|
</p> |
|
|
|
- Face Identity |
|
|
|
<p align="center"> |
|
<img src="figures/controlnet-face.jpg" width="800"> |
|
</p> |
|
|
|
**Note**: The Face Identity ControlNet will be released at a later point. |
|
|
|
- Canny |
|
|
|
<p align="center"> |
|
<img src="figures/controlnet-canny.jpg" width="800"> |
|
</p> |
|
|
|
- Super Resolution |
|
<p align="center"> |
|
<img src="figures/controlnet-sr.jpg" width="800"> |
|
</p> |
|
|
|
These can all be used through the same notebook and only require changing the config for each ControlNet. More |
|
information is provided in the [inference guide](inference). |
|
#### LoRA |
|
We also provide our own implementation for training and using LoRAs with Stable Cascade, which can be used to finetune |
|
the text-conditional model (Stage C). Specifically, you can add and learn new tokens and add LoRA layers to the model. |
|
This [notebook](inference/lora.ipynb) shows how you can use a trained LoRA. |
|
For example, training a LoRA on my dog with the following kind of training images: |
|
<p align="center"> |
|
<img src="figures/fernando_original.jpg" width="800"> |
|
</p> |
|
|
|
Lets me generate the following images of my dog given the prompt: |
|
`Cinematic photo of a dog [fernando] wearing a space suit.` |
|
<p align="center"> |
|
<img src="figures/fernando.jpg" width="800"> |
|
</p> |
|
|
|
#### Image Reconstruction |
|
Lastly, one thing that might be very interesting for people, especially if you want to train your own text-conditional |
|
model from scratch, maybe even with a completely different architecture than our Stage C, is to use the (Diffusion) |
|
Autoencoder that Stable Cascade uses to be able to work in the highly compressed space. Just like people use Stable |
|
Diffusion's VAE to train their own models (e.g. Dalle3), you could use Stage A & B in the same way, while |
|
benefiting from a much higher compression, allowing you to train and run models faster. <br> |
|
The notebook shows how to encode and decode images and what specific benefits you get. |
|
For example, say you have the following batch of images of dimension `4 x 3 x 1024 x 1024`: |
|
<p align="center"> |
|
<img src="figures/original.jpg" width="800"> |
|
</p> |
|
|
|
You can encode these images to a compressed size of `4 x 16 x 24 x 24`, giving you a spatial compression factor of |
|
`1024 / 24 = 42.67`. Afterwards you can use Stage A & B to decode the images back to `4 x 3 x 1024 x 1024`, giving you |
|
the following output: |
|
<p align="center"> |
|
<img src="figures/reconstructed.jpg" width="800"> |
|
</p> |
|
|
|
As you can see, the reconstructions are surprisingly close, even for small details. Such reconstructions are not |
|
possible with a standard VAE etc. The [notebook](inference/reconstruct_images.ipynb) gives you more information and easy code to try it out. |
|
|
|
### Training |
|
We provide code for training Stable Cascade from scratch, finetuning, ControlNet and LoRA. You can find a comprehensive |
|
explanation for how to do so in the [training folder](train). |
|
|
|
## Remarks |
|
The codebase is in early development. You might encounter unexpected errors or not perfectly optimized training and |
|
inference code. We apologize for that in advance. If there is interest, we will continue releasing updates to it, |
|
aiming to bring in the latest improvements and optimizations. Moreover, we would be more than happy to receive |
|
ideas, feedback or even updates from people that would like to contribute. Cheers. |
|
|
|
## Gradio App |
|
First install gradio and diffusers by running: |
|
``` |
|
pip3 install gradio |
|
pip3 install accelerate # optionally |
|
pip3 install git+https://github.com/kashif/diffusers.git@wuerstchen-v3 |
|
``` |
|
Then from the root of the project run this command: |
|
``` |
|
PYTHONPATH=./ python3 gradio_app/app.py |
|
``` |
|
|
|
## Citation |
|
```bibtex |
|
@misc{pernias2023wuerstchen, |
|
title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models}, |
|
author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville}, |
|
year={2023}, |
|
eprint={2306.00637}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
## LICENSE |
|
All the code from this repo is under an [MIT LICENSE](LICENSE) |
|
The model weights, that you can get from Hugginface following [these instructions](/models/readme.md), are under a [STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE](WEIGHTS_LICENSE) |
|
|