Text-to-Image
Diffusers
Safetensors
StableDiffusionPipeline
stable-diffusion
Inference Endpoints

Loading the CLIPModel (or CLIPVisionModel) that matches this checkpoint

#8
by danaarad - opened

Hi, Im having trouble finding a CLIPModel checkpoint (or just the CLIPVisionModel) that matches the CLIPTextModel used in this version. The open-clip library provides a different interface (not using the CLIPVisionModel class). Other CLIPModel checkpoint do not match the projection dim of this version (1024, while other checkpoints here are 768 or 512). Does anyone has a solution or can refer me to the correct checkpoint?
Thanks!

hello, do you get the clip vision model finally? I tested the laion/CLIP-ViT-H-14-laion2B-s32B-b79K model, but the result and parameters are not the same. I'm confused, which clip model is used for SD2/2-1?

Hi, do you figure it out?

Hi, didn't figure it out, ended up using a previous SD version. If anyone has any input please share!

Also interested in this !

I have tested laion/CLIP-ViT-H-14-laion2B-s32B-b79K, too. But it seems not to work well. I found that the output of the CLIPTextModel in SD2.1 is 1024. But the output of the CLIPVisionModel in laion/CLIP-ViT-H-14-laion2B-s32B-b79K is 1280. They are not compatible. Then I used CLIPVisionModelWithProjection in laion/CLIP-ViT-H-14-laion2B-s32B-b79K. After that, Their dimensions are matched. But I do not confirm whether I was correct.

Sign up or log in to comment