Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- clip
|
4 |
+
- siglip
|
5 |
+
library_name: open_clip
|
6 |
+
pipeline_tag: zero-shot-image-classification
|
7 |
+
license: apache-2.0
|
8 |
+
datasets:
|
9 |
+
- webli
|
10 |
+
---
|
11 |
+
# Model card for ViT-B-16-SigLIP-i18n-256
|
12 |
+
|
13 |
+
A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.
|
14 |
+
|
15 |
+
This model has been converted to from Open-CLIP : [timm/ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) to huggingface CLIPVisionModel
|
16 |
+
|
17 |
+
```Python
|
18 |
+
from transformers import CLIPVisionModel, CLIPImageProcessor
|
19 |
+
from PIL import Image
|
20 |
+
import requests
|
21 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
22 |
+
|
23 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
24 |
+
inputs = image_processor(images=image, return_tensors="pt", padding=True)
|
25 |
+
|
26 |
+
vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
|
27 |
+
outputs = vision_tower(**inputs)
|
28 |
+
|
29 |
+
logits_per_image = outputs.pooler_output # this is the image-text similarity score
|
30 |
+
```
|
31 |
+
|
32 |
+
There's still a slight different where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.
|
33 |
+
|