Spaces:

ameerazam08
/

Stable-Cascade-Super-Resolution

Paused

App Files Files Community

Stable-Cascade-Super-Resolution / readme.md

ameerazam08

Upload folder using huggingface_hub

6a6edcb verified 12 months ago

preview code

raw

history blame

9.96 kB

	# Stable Cascade
	<p align="center">
	<img src="figures/collage_1.jpg" width="800">
	</p>

	This is the official codebase for Stable Cascade. We provide training & inference scripts, as well as a variety of different models you can use.
	<br><br>
	This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
	difference to other models, like Stable Diffusion, is that it is working at a much smaller latent space. Why is this
	important? The smaller the latent space, the faster you can run inference and the cheaper the training becomes.
	How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
	encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
	1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
	highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
	Diffusion 1.5. <br> <br>
	Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
	like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well. A few of those are
	already provided (finetuning, ControlNet, LoRA) in the [training](train) and [inference](inference) sections.

	Moreover, Stable Cascade achieves impressive results, both visually and evaluation wise. According to our evaluation,
	Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all comparisons. The above picture
	shows the results from a human evaluation using a mix of parti-prompts (link) and aesthetic prompts. Specifically,
	Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference steps), SDXL (50 inference steps),
	SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
	<br>
	<p align="center">
	<img height="300" src="figures/comparison.png"/>
	</p>

	Stable Cascade´s focus on efficiency is evidenced through its architecture and a higher compressed latent space.
	Despite the largest model containing 1.4 billion parameters more than Stable Diffusion XL, it still features faster
	inference times, as can be seen in the figure below.

	<p align="center">
	<img height="300" src="figures/comparison-inference-speed.jpg"/>
	</p>

	<hr>
	<p align="center">
	<img src="figures/collage_2.jpg" width="800">
	</p>

	## Model Overview
	Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade for generating images,
	hence the name "Stable Cascade".
	Stage A & B are used to compress images, similarly to what the job of the VAE is in Stable Diffusion.
	However, as mentioned before, with this setup a much higher compression of images can be achieved. Furthermore, Stage C
	is responsible for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
	Note that Stage A is a VAE and both Stage B & C are diffusion models.

	<p align="center">
	<img src="figures/model-overview.jpg" width="600">
	</p>

	For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
	a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
	put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
	great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
	best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
	its small size.

	## Getting Started
	This section will briefly outline how you can get started with Stable Cascade.

	### Inference
	Running the model can be done through the notebooks provided in the [inference](inference) section. You will find more
	details regarding downloading the models, compute requirements as well as some tutorials on how to use the models.
	Specifically, there are four notebooks provided for the following use-cases:
	#### Text-to-Image
	A compact [notebook](inference/text_to_image.ipynb) that provides you with basic functionality for text-to-image,
	image-variation and image-to-image.
	- Text-to-Image

	`Cinematic photo of an anthropomorphic penguin sitting in a cafe reading a book and having a coffee.`
	<p align="center">
	<img src="figures/text-to-image-example-penguin.jpg" width="800">
	</p>

	- Image Variation

	The model can also understand image embeddings, which makes it possible to generate variations of a given image (left).
	There was no prompt given here.
	<p align="center">
	<img src="figures/image-variations-example-headset.jpg" width="800">
	</p>

	- Image-to-Image

	This works just as usual, by noising an image up to a specific point and then letting the model generate from that
	starting point. Here the left image is noised to 80% and the caption is: `A person riding a rodent.`
	<p align="center">
	<img src="figures/image-to-image-example-rodent.jpg" width="800">
	</p>

	Furthermore, the model is also accessible in the diffusers 🤗 library. You can find the documentation and usage [here](https://huggingface.co/stabilityai/stable-cascade).
	#### ControlNet
	This [notebook](inference/controlnet.ipynb) shows how to use ControlNets that were trained by us or how to use one that
	you trained yourself for Stable Cascade. With this release, we provide the following ControlNets:
	- Inpainting / Outpainting

	<p align="center">
	<img src="figures/controlnet-paint.jpg" width="800">
	</p>

	- Face Identity

	<p align="center">
	<img src="figures/controlnet-face.jpg" width="800">
	</p>

	Note: The Face Identity ControlNet will be released at a later point.

	- Canny

	<p align="center">
	<img src="figures/controlnet-canny.jpg" width="800">
	</p>

	- Super Resolution
	<p align="center">
	<img src="figures/controlnet-sr.jpg" width="800">
	</p>

	These can all be used through the same notebook and only require changing the config for each ControlNet. More
	information is provided in the [inference guide](inference).
	#### LoRA
	We also provide our own implementation for training and using LoRAs with Stable Cascade, which can be used to finetune
	the text-conditional model (Stage C). Specifically, you can add and learn new tokens and add LoRA layers to the model.
	This [notebook](inference/lora.ipynb) shows how you can use a trained LoRA.
	For example, training a LoRA on my dog with the following kind of training images:
	<p align="center">
	<img src="figures/fernando_original.jpg" width="800">
	</p>

	Lets me generate the following images of my dog given the prompt:
	`Cinematic photo of a dog [fernando] wearing a space suit.`
	<p align="center">
	<img src="figures/fernando.jpg" width="800">
	</p>

	#### Image Reconstruction
	Lastly, one thing that might be very interesting for people, especially if you want to train your own text-conditional
	model from scratch, maybe even with a completely different architecture than our Stage C, is to use the (Diffusion)
	Autoencoder that Stable Cascade uses to be able to work in the highly compressed space. Just like people use Stable
	Diffusion's VAE to train their own models (e.g. Dalle3), you could use Stage A & B in the same way, while
	benefiting from a much higher compression, allowing you to train and run models faster. <br>
	The notebook shows how to encode and decode images and what specific benefits you get.
	For example, say you have the following batch of images of dimension `4 x 3 x 1024 x 1024`:
	<p align="center">
	<img src="figures/original.jpg" width="800">
	</p>

	You can encode these images to a compressed size of `4 x 16 x 24 x 24`, giving you a spatial compression factor of
	`1024 / 24 = 42.67`. Afterwards you can use Stage A & B to decode the images back to `4 x 3 x 1024 x 1024`, giving you
	the following output:
	<p align="center">
	<img src="figures/reconstructed.jpg" width="800">
	</p>

	As you can see, the reconstructions are surprisingly close, even for small details. Such reconstructions are not
	possible with a standard VAE etc. The [notebook](inference/reconstruct_images.ipynb) gives you more information and easy code to try it out.

	### Training
	We provide code for training Stable Cascade from scratch, finetuning, ControlNet and LoRA. You can find a comprehensive
	explanation for how to do so in the [training folder](train).

	## Remarks
	The codebase is in early development. You might encounter unexpected errors or not perfectly optimized training and
	inference code. We apologize for that in advance. If there is interest, we will continue releasing updates to it,
	aiming to bring in the latest improvements and optimizations. Moreover, we would be more than happy to receive
	ideas, feedback or even updates from people that would like to contribute. Cheers.

	## Gradio App
	First install gradio and diffusers by running:
	```
	pip3 install gradio
	pip3 install accelerate # optionally
	pip3 install git+https://github.com/kashif/diffusers.git@wuerstchen-v3
	```
	Then from the root of the project run this command:
	```
	PYTHONPATH=./ python3 gradio_app/app.py
	```

	## Citation
	```bibtex
	@misc{pernias2023wuerstchen,
	title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
	author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville},
	year={2023},
	eprint={2306.00637},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## LICENSE
	All the code from this repo is under an [MIT LICENSE](LICENSE)
	The model weights, that you can get from Hugginface following [these instructions](/models/readme.md), are under a [STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE](WEIGHTS_LICENSE)