zer0int
/

CLIP-GmP-ViT-L-14

Zero-Shot Image Classification

Inference Endpoints

Model card Files Files and versions Community

Difference between 300 MB and 900 MB versions?

#10

by Geralt28 - opened 25 days ago

25 days ago

What are differences between versions (2 different sizes of files for TEXT and smooth) f.e.:
ViT-L-14-TEXT-detail-improved-hiT-GmP-HF.safetensors
ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF.safetensors

In one there is "TE-only-HF" in second "GmP-HF". the same situation for smooth version.

Ps. Yes I saw "You'll generally want the "TE-only" .safetensors" in readme, but still I wonder what a differences:)

zer0int

Owner 25 days ago

"TE only" stands for "Text Encoder only". The "full CLIP" (larger file) has a text encoder and an image encoder; you'll need that for e.g. zero-shot image classification or anything else where CLIP needs to know (encode) the image AND the text.

For a text-to-image AI system, CLIP is just the "translator" from natural language to "AI space", so it encodes the text prompt and passes that to the generative AI. In this scenario, CLIP does not need its vision transformer, and alas is a "Text Encoder only".

Hope that helps! :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment