ikala
/

ViT-B-16-SigLIP-i18n-256-hf

Zero-Shot Image Classification

Inference Endpoints

Model card Files Files and versions Community

ikala-ray commited on Oct 25, 2023

Commit

a574114

·

1 Parent(s): 09e6fd1

Create README.md

Files changed (1) hide show

README.md +33 -0

README.md ADDED Viewed

	@@ -0,0 +1,33 @@

+---
+tags:
+- clip
+- siglip
+library_name: open_clip
+pipeline_tag: zero-shot-image-classification
+license: apache-2.0
+datasets:
+- webli
+---
+# Model card for ViT-B-16-SigLIP-i18n-256
+A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.
+This model has been converted to from Open-CLIP : [timm/ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) to huggingface CLIPVisionModel
+```Python
+from transformers import CLIPVisionModel, CLIPImageProcessor
+from PIL import Image
+import requests
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = image_processor(images=image, return_tensors="pt", padding=True)
+vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
+outputs = vision_tower(**inputs)
+logits_per_image = outputs.pooler_output  # this is the image-text similarity score
+```
+There's still a slight different where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.