BK-SDM-2M Model Card
BK-SDM-{Base-2M, Small-2M, Tiny-2M} are pretrained with 10× more data (2.3M LAION image-text pairs) compared to our previous release.
- Block-removed Knowledge-distilled Stable Diffusion Model (BK-SDM) is an architecturally compressed SDM for efficient text-to-image synthesis.
- The previous BK-SDM-{Base, Small, Tiny} were obtained via distillation pretraining on 0.22M LAION pairs.
- Resources for more information: Paper, GitHub, Demo.
Examples with 🤗Diffusers library.
An inference code with the default PNDM scheduler and 50 denoising steps is as follows.
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("nota-ai/bk-sdm-tiny-2m", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a black vase holding a bouquet of roses"
image = pipe(prompt).images[0]
image.save("example.png")
Compression Method
Adhering to the U-Net architecture and distillation pretraining of BK-SDM, the difference in BK-SDM-2M is a 10× increase in the number of training pairs.
- Training Data: 2,256,472 image-text pairs (i.e., 2.3M pairs) from LAION-Aesthetics V2 6.25+.
- Hardware: A single NVIDIA A100 80GB GPU
- Gradient Accumulations: 4
- Batch: 256 (=4×64)
- Optimizer: AdamW
- Learning Rate: a constant learning rate of 5e-5 for 50K-iteration pretraining
Experimental Results
The following table shows the zero-shot results on 30K samples from the MS-COCO validation split. After generating 512×512 images with the PNDM scheduler and 25 denoising steps, we downsampled them to 256×256 for evaluating generation scores.
- Our models were drawn at the 50K-th training iteration.
Model | FID↓ | IS↑ | CLIP Score↑ (ViT-g/14) |
# Params, U-Net |
# Params, Whole SDM |
---|---|---|---|---|---|
Stable Diffusion v1.4 | 13.05 | 36.76 | 0.2958 | 0.86B | 1.04B |
BK-SDM-Base (Ours) | 15.76 | 33.79 | 0.2878 | 0.58B | 0.76B |
BK-SDM-Base-2M (Ours) | 14.81 | 34.17 | 0.2883 | 0.58B | 0.76B |
BK-SDM-Small (Ours) | 16.98 | 31.68 | 0.2677 | 0.49B | 0.66B |
BK-SDM-Small-2M (Ours) | 17.05 | 33.10 | 0.2734 | 0.49B | 0.66B |
BK-SDM-Tiny (Ours) | 17.12 | 30.09 | 0.2653 | 0.33B | 0.50B |
BK-SDM-Tiny-2M (Ours) | 17.53 | 31.32 | 0.2690 | 0.33B | 0.50B |
Effect of Different Data Sizes for Training BK-SDM-Small
Increasing the number of training pairs improves the IS and CLIP scores over training progress. The MS-COCO 256×256 30K benchmark was used for evaluation.
Furthermore, with the growth in data volume, visual results become more favorable (e.g., better image-text alignment and clear distinction among objects).
Additional Visual Examples
Uses
Follow the usage guidelines of Stable Diffusion v1.
Acknowledgments
- We express our gratitude to Microsoft for Startups Founders Hub for generously providing the Azure credits used during pretraining.
- We deeply appreciate the pioneering research on Latent/Stable Diffusion conducted by CompVis, Runway, and Stability AI.
- Special thanks to the contributors to LAION, Diffusers, and Gradio for their valuable support.
Citation
@article{kim2023architectural,
title={On Architectural Compression of Text-to-Image Diffusion Models},
author={Kim, Bo-Kyeong and Song, Hyoung-Kyu and Castells, Thibault and Choi, Shinkook},
journal={arXiv preprint arXiv:2305.15798},
year={2023},
url={https://arxiv.org/abs/2305.15798}
}
@article{Kim_2023_ICMLW,
title={BK-SDM: Architecturally Compressed Stable Diffusion for Efficient Text-to-Image Generation},
author={Kim, Bo-Kyeong and Song, Hyoung-Kyu and Castells, Thibault and Choi, Shinkook},
journal={ICML Workshop on Efficient Systems for Foundation Models (ES-FoMo)},
year={2023},
url={https://openreview.net/forum?id=bOVydU0XKC}
}
This model card was written by Bo-Kyeong Kim and is based on the Stable Diffusion v1 model card.
- Downloads last month
- 11