Edit model card

Model card for ViT-SO400M-14-SigLIP-384

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.

This model has been converted from Open-CLIP : timm/ViT-SO400M-14-SigLIP-384 to huggingface CLIPVisionModel

from transformers import CLIPVisionModel, CLIPImageProcessor
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt", padding=True)

vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-SO400M-14-SigLIP-384-hf')
outputs = vision_tower(**inputs)

logits_per_image = outputs.pooler_output  # this is the image-text similarity score

There's still a slight difference where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.

Downloads last month
3,325
Inference API
This model can be loaded on Inference API (serverless).