ikala-ray commited on
Commit
a574114
·
1 Parent(s): 09e6fd1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - clip
4
+ - siglip
5
+ library_name: open_clip
6
+ pipeline_tag: zero-shot-image-classification
7
+ license: apache-2.0
8
+ datasets:
9
+ - webli
10
+ ---
11
+ # Model card for ViT-B-16-SigLIP-i18n-256
12
+
13
+ A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.
14
+
15
+ This model has been converted to from Open-CLIP : [timm/ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) to huggingface CLIPVisionModel
16
+
17
+ ```Python
18
+ from transformers import CLIPVisionModel, CLIPImageProcessor
19
+ from PIL import Image
20
+ import requests
21
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
22
+
23
+ image = Image.open(requests.get(url, stream=True).raw)
24
+ inputs = image_processor(images=image, return_tensors="pt", padding=True)
25
+
26
+ vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
27
+ outputs = vision_tower(**inputs)
28
+
29
+ logits_per_image = outputs.pooler_output # this is the image-text similarity score
30
+ ```
31
+
32
+ There's still a slight different where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.
33
+