Furception v1.0, by Project RedRocket.

This is a VAE decoder finetune, resumed from stabilityai/sd-vae-ft-mse using images from e621. It is trained with a mixture of MAE and MSE loss to maintain an acceptable balance between sharpness and smooth outputs, and loss is calculated in Oklab color space in order to prioritize image reconstruction based on which color channels are more perceptually significant.

Our testing has shown that the VAE is good at eliminating unwanted high-frequency noise when used on models trained on similar data. Results are far more apparent on flat-colored images than they are on realistic or painterly images, but we have not noticed any obvious loss of performance on any type of image. The effects are also more noticeable on lower-resolution generated images, but there are improvements at all resolutions. It may have some generalizability to a broader range of art styles due to the variety of different styles in the dataset.

Default VAE (kl-f8)	Furception v1.0

Note that the output is overall smoother and has significantly less artifacting around edges in high-detail regions.

Licensing:

This VAE is available under the terms of the CC BY-NC-SA 4.0 Deed. This applies to the use of the model, deployment, and distribution of the model weights only. The license does not apply to images decoded by this VAE and you may release them under any license, even public domain, as long as you are not creating them for commercial purposes. You are free and encouraged to distribute this VAE with models as long as you give credit and the VAE carries this license (the rest of the model does not need to share this license, although its distribution must be non-commercial), and I would ask that you include the version number so people can know if they need to get an updated version in the future.

Training details:

Overall training is fundamentally similar to LDM. We used the same relative base weights for MAE, MSE, and LPIPS as used in LDM and in sd-vae-ft-mse in the case of LPIPS. The discriminator's weight in the loss objective is dynamically set so that the gradient norm for the discriminator is half that of the reconstruction loss, just like LDM. We used a similar discriminator to what LDM uses, except reparameterized to Wasserstein loss with a gradient penalty and with its group norm layers replaced with layer norms.

Training for version 1.0 used random square crops at various levels of downscales (Lanczos with antialiasing), randomly rotated and flipped. Training ran for 150,000 steps at a batch size of 32. EMA weights were accumulated using a similar decay to sd-vae-ft-mse scaled for our batch size and are the release version of the model.

Credits:

Development and research lead by @drhead.
With research and development assistance by @RedHotTensors.
And additional research assistance by @lodestones and Thessalo.
Dataset curation by @lodestones and Bannanapuncakes, with additional curation by @RedHotTensors.
And thanks to dogarrowtype for system administration assistance.

Based on:

CompVis Latent Diffusion: https://github.com/CompVis/latent-diffusion/
StabilityAI sd-vae-ft-mse: https://huggingface.co/stabilityai/sd-vae-ft-mse
LPIPS by Richard Zhang, et al: https://github.com/richzhang/PerceptualSimilarity
OkLab by Björn Ottosson: https://bottosson.github.io/posts/oklab/
fine-tune-models by Jonathan Chang: https://github.com/cccntu/fine-tune-models/

Built on:

Flax by Google Brain: https://github.com/google/flax
And Huggingface Diffusers: https://github.com/huggingface/diffusers

With deep thanks to the innumerable artists who released their works to the public for fair use in this non-commercial research project.

RedRocket
/

furception_vae