dome272 commited on
Commit
99fc217
1 Parent(s): a844c76

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -4
README.md CHANGED
@@ -5,12 +5,21 @@ license_name: stable-cascade-nc-community
5
  license_link: LICENSE
6
  ---
7
 
8
- # Stable Cascade Text-to-Image Model Card
9
 
10
  <!-- Provide a quick summary of what the model is/does. -->
11
- ![image]()
12
-
13
- Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
 
 
 
 
 
 
 
 
 
14
 
15
  ## Model Details
16
 
@@ -29,8 +38,31 @@ For research purposes, we recommend our `StableCascade` Github repository (https
29
  - **Repository:** https://github.com/Stability-AI/StableCascade
30
  - **Paper:** https://openreview.net/forum?id=gU58d5QeGv
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Evaluation
 
 
 
 
 
34
 
35
  ## Uses
36
 
 
5
  license_link: LICENSE
6
  ---
7
 
8
+ # Stable Cascade Model Card
9
 
10
  <!-- Provide a quick summary of what the model is/does. -->
11
+ <img src="figures/collage_1.jpg" width="800">
12
+
13
+ This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
14
+ difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
15
+ important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
16
+ How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
17
+ encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
18
+ 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
19
+ highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
20
+ Diffusion 1.5. <br> <br>
21
+ Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
22
+ like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
23
 
24
  ## Model Details
25
 
 
38
  - **Repository:** https://github.com/Stability-AI/StableCascade
39
  - **Paper:** https://openreview.net/forum?id=gU58d5QeGv
40
 
41
+ ### Model Overview
42
+ Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
43
+ hence the name "Stable Cascade".
44
+ Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
45
+ However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
46
+ spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
47
+ a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
48
+ image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
49
+ for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
50
+
51
+ <img src="figures/model-overview.jpg" width="600">
52
+
53
+ For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
54
+ a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
55
+ put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
56
+ great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
57
+ best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
58
+ its small size.
59
 
60
  ## Evaluation
61
+ <img height="300" src="figures/comparison.png"/>
62
+ According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
63
+ comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
64
+ aesthetic prompts. Specifically, the comparison was held against Playground v2, SDXL Turbo, SDXL and Würstchen v2.
65
+
66
 
67
  ## Uses
68