stabilityai/stable-diffusion-x4-upscaler · Use pretrain VAE to encode a 512x512 image to latent space get nan, the image has been normalized to [-1,1]

Feb 26, 2023

•

edited Feb 26, 2023

I try to fine-tuning the upscaler model with my own data, however, I find when I encode the 512x512 image to latent space 128x128 with the pretrain VAE parameter, I get nan with size [b,4,128,128].

I have tracked the VAE forward function. I find that following the calculation map, the data will soon become huge and data overflow will happen.

I use the stable diffusion fine-tuning script in the following link and modify the script with my own dataset since there is no finetuning script for this x4-upscaler model.
https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py

Is there any solution for this error?

double8fun

Apr 13, 2023

•

edited Apr 13, 2023

use fp32 instead of fp16 and have a try.

LetsThink

Apr 15, 2023

Thanks for your help, this works. It seems when training text_to_img model, VAE model with --mixed_precision="fp16" can work fine. But for x4-upscaler model, just set VAE to torch.float16 will overflow.

LetsThink changed discussion status to closed Apr 15, 2023

YuntaoChen

Jul 27, 2023

It seems that the 4x scaler vae will generate intermediate activation tensor with extreme values of 1e7-1e8