stabilityai
/

stable-cascade

Text-to-Image

Diffusers

Safetensors

StableCascadeDecoderPipeline

Model card Files Files and versions Community

dome272 commited on Feb 10

Commit

99fc217

•

1 Parent(s): a844c76

Update README.md

Browse files

Files changed (1) hide show

README.md +36 -4

README.md CHANGED Viewed

@@ -5,12 +5,21 @@ license_name: stable-cascade-nc-community
 license_link: LICENSE
 ---
-# Stable Cascade Text-to-Image Model Card
 <!-- Provide a quick summary of what the model is/does. -->
-![image]()
-Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
 ## Model Details
@@ -29,8 +38,31 @@ For research purposes, we recommend our `StableCascade` Github repository (https
 - **Repository:** https://github.com/Stability-AI/StableCascade
 - **Paper:** https://openreview.net/forum?id=gU58d5QeGv
 ## Evaluation
 ## Uses

 license_link: LICENSE
 ---
+# Stable Cascade Model Card
 <!-- Provide a quick summary of what the model is/does. -->
+<img src="figures/collage_1.jpg" width="800">
+This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
+difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
+important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
+How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
+encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
+1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
+highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
+Diffusion 1.5. <br> <br>
+Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
+like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
 ## Model Details
 - **Repository:** https://github.com/Stability-AI/StableCascade
 - **Paper:** https://openreview.net/forum?id=gU58d5QeGv
+### Model Overview
+Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
+hence the name "Stable Cascade".
+Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
+However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
+spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
+a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
+image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
+for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
+<img src="figures/model-overview.jpg" width="600">
+For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
+a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
+put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
+great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
+best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
+its small size.
 ## Evaluation
+<img height="300" src="figures/comparison.png"/>
+According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
+comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
+aesthetic prompts. Specifically, the comparison was held against Playground v2, SDXL Turbo, SDXL and Würstchen v2.
 ## Uses