metadata

license: other
pipeline_tag: image-to-image

StableSR Model Card

This model card focuses on the models associated with the StableSR, available here.

Model Details

Developed by: Jianyi Wang
Model type: Diffusion-based image super-resolution model
License: S-Lab License 1.0
Model Description: This is the model used in Paper.
Resources for more information: GitHub Repository.

Cite as:

@InProceedings{wang2023exploiting,
    author    = {Wang, Jianyi and Yue, Zongsheng and Zhou, Shangchen and Chan, Kelvin CK and Loy, Chen Change},
    title     = {Exploiting Diffusion Prior for Real-World Image Super-Resolution},
    booktitle = {arXiv preprint arXiv:2305.07015},
    year      = {2023},
}

Uses

Please refer to S-Lab License 1.0

Limitations and Bias

Limitations

StableSR still requires multiple steps for generating an image, which is much slower than GAN-based approaches, especially for large images beyond 512 or 768.
StableSR sometimes cannot keep 100% fidelity due to its generative nature.
StableSR sometimes cannot generate perfect details under complex real-world scenarios.

Bias

While our model is based on a pre-trained Stable Diffusion model, currently we do not observe obvious bias in generated results. We conjecture the main reason is that our model does not rely on text prompts but on low-resolution images. Such strong conditions make our model less likely to be affected.

Training

Training Data The model developer used the following dataset for training the model:

Our diffusion model is finetuned on DF2K (DIV2K and Flickr2K) + OST datasets, available here.
We further generate 100k synthetic LR-HR pairs on DF2K_OST using the finetuned diffusion model for training the CFW module.

Training Procedure StableSR is an image super-resolution model finetuned on Stable Diffusion, further equipped with a time-aware encoder and a controllable feature wrapping (CFW) module.

Following Stable Diffusion, images are encoded through the fixed autoencoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4.
The latent representations are fed to the time-aware encoder as guidance.
The loss is the same as Stable Diffusion.
After finetuning the diffusion model, we further train the CFW module using the data generated by the finetuned diffusion model.
The autoencoder model is fixed and only CFW is trainable.
The loss is similar to training an autoencoder, except that we use a fixed adversarial loss weight of 0.025 rather than a self-adjustable one.

We currently provide the following checkpoints:

stablesr_000117.ckpt: Diffusion model finetuned on SD2.1-512base with DF2K_OST dataset for 117 epochs.
vqgan_cfw_00011.ckpt: CFW module with fixed autoencoder trained on synthetic paired data for 11 epochs.
stablesr_768v_000139.ckpt: Diffusion model finetuned on SD2.1-768v with DF2K_OST dataset for 139 epochs.

Evaluation Results

See Paper for details.