Difference between 300 MB and 900 MB versions?
What are differences between versions (2 different sizes of files for TEXT and smooth) f.e.:
ViT-L-14-TEXT-detail-improved-hiT-GmP-HF.safetensors
ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF.safetensors
In one there is "TE-only-HF" in second "GmP-HF". the same situation for smooth version.
Ps. Yes I saw "You'll generally want the "TE-only" .safetensors" in readme, but still I wonder what a differences:)
"TE only" stands for "Text Encoder only". The "full CLIP" (larger file) has a text encoder and an image encoder; you'll need that for e.g. zero-shot image classification or anything else where CLIP needs to know (encode) the image AND the text.
For a text-to-image AI system, CLIP is just the "translator" from natural language to "AI space", so it encodes the text prompt and passes that to the generative AI. In this scenario, CLIP does not need its vision transformer, and alas is a "Text Encoder only".
Hope that helps! :)