|
--- |
|
license: apache-2.0 |
|
pipeline_tag: image-to-image |
|
--- |
|
|
|
# S3Diff Model Card |
|
This model card focuses on the models associated with the S3Diff, available [here](https://github.com/ArcticHare105/S3Diff). |
|
|
|
## Model Details |
|
- **Developed by:** Aiping Zhang |
|
- **Model type:** Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors |
|
- **Model Description:** This is the model used in [Paper](https://arxiv.org/abs/2409.17058). |
|
- **Resources for more information:** [GitHub Repository](https://github.com/ArcticHare105/S3Diff). |
|
- **Cite as:** |
|
|
|
@article{2024s3diff, |
|
author = {Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, Xiaochun Cao}, |
|
title = {Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors}, |
|
journal = {arxiv}, |
|
year = {2024}, |
|
} |
|
|
|
## Limitations and Bias |
|
|
|
### Limitations |
|
|
|
- S3Diff requires a tiled operation for generating a high-resolution image, which would largely increase the inference time. |
|
- S3Diff sometimes cannot keep 100% fidelity due to its generative nature. |
|
- S3Diff sometimes cannot generate perfect details under complex real-world scenarios. |
|
|
|
### Bias |
|
While our model is based on a pre-trained SD-Turbo model, currently we do not observe obvious bias in generated results. |
|
We conjecture the main reason is that our model does not rely on text prompts but on low-resolution images. |
|
Such strong conditions make our model less likely to be affected. |
|
|
|
## Training |
|
|
|
**Training Data** |
|
The model developer used the following dataset for training the model: |
|
|
|
- Our model is finetuned on [LSDIR](https://data.vision.ee.ethz.ch/yawli/index.html) + 10K samples from FFHQ datasets. |
|
|
|
**Training Procedure** |
|
S3Diff is an image super-resolution model finetuned on [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo), further equipped with a degradation-guided LoRA and online negative prompting. |
|
|
|
- Following SD-Turbo, images are encoded through the fixed autoencoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4. |
|
- The LR images are fed to the degradation estimation network, trained by [mm-realsr](https://github.com/TencentARC/MM-RealSR), to predict degradation scores. |
|
- We only inject LoRA layers into the VAE encoder and UNet. |
|
- The total loss includes an L2 Loss, an LPIPS loss, and a GAN loss. |
|
|
|
We currently provide the following checkpoints: |
|
|
|
- [s3diff.pkl](https://huggingface.co/zhangap/S3Diff/blob/main/s3diff.pkl): S3Diff finetuned on [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) for 30k iterations. |
|
- [de_net.pth](https://huggingface.co/zhangap/S3Diff/blob/main/de_net.pth): The degradation estimation network, extracted from [mm-realsr](https://github.com/TencentARC/MM-RealSR). |
|
|
|
## Evaluation Results |
|
See [Paper](https://arxiv.org/abs/2409.17058) for details. |
|
|