Use pretrain VAE to encode a 512x512 image to latent space get nan, the image has been normalized to [-1,1]

#10
by LetsThink - opened

I try to fine-tuning the upscaler model with my own data, however, I find when I encode the 512x512 image to latent space 128x128 with the pretrain VAE parameter, I get nan with size [b,4,128,128].

I have tracked the VAE forward function. I find that following the calculation map, the data will soon become huge and data overflow will happen.

image.png

I use the stable diffusion fine-tuning script in the following link and modify the script with my own dataset since there is no finetuning script for this x4-upscaler model.
https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py

Is there any solution for this error?

use fp32 instead of fp16 and have a try.
image.png

Thanks for your help, this works. It seems when training text_to_img model, VAE model with --mixed_precision="fp16" can work fine. But for x4-upscaler model, just set VAE to torch.float16 will overflow.

LetsThink changed discussion status to closed

It seems that the 4x scaler vae will generate intermediate activation tensor with extreme values of 1e7-1e8

Sign up or log in to comment