language:
- ja
tags:
- clip
- japanese-stable-clip
pipeline_tag: feature-extraction
license: other
extra_gated_prompt: >-
By downloading, using, or distributing any portion or element of this model,
you agree to be bound by the agreement described in the LICENSE file.
extra_gated_fields:
Name: text
Email: text
Country: text
Organization or Affiliation: text
I allow Stability AI to contact me about information related to its models and research: checkbox
Japanese Stable CLIP ViT-L/16
Model Details
Japanese Stable CLIP is a Japanese CLIP (Contrastive Language-Image Pre-Training) model that enables to map both Japanese texts and images to the same embedding space. This model alone is capable of tasks such as zero-shot image classification and text-to-image retrieval. Furthermore, when combined with other components, it can be used as part of generative models, such as image-to-text and text-to-image generation.
Usage
from typing import Union, List
import ftfy, html, re, io
import requests
from PIL import Image
import torch
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor, BatchFeature
# taken from https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/tokenizer.py#L65C8-L65C8
def basic_clean(text):
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text))
return text.strip()
def whitespace_clean(text):
text = re.sub(r"\s+", " ", text)
text = text.strip()
return text
def tokenize(
tokenizer,
texts: Union[str, List[str]],
max_seq_len: int = 77,
):
"""
This is a function that have the original clip's code has.
https://github.com/openai/CLIP/blob/main/clip/clip.py#L195
"""
if isinstance(texts, str):
texts = [texts]
texts = [whitespace_clean(basic_clean(text)) for text in texts]
inputs = tokenizer(
texts,
max_length=max_seq_len - 1,
padding="max_length",
truncation=True,
add_special_tokens=False,
)
# add bos token at first place
input_ids = [[tokenizer.bos_token_id] + ids for ids in inputs["input_ids"]]
attention_mask = [[1] + am for am in inputs["attention_mask"]]
position_ids = [list(range(0, len(input_ids[0])))] * len(texts)
return BatchFeature(
{
"input_ids": torch.tensor(input_ids, dtype=torch.long),
"attention_mask": torch.tensor(attention_mask, dtype=torch.long),
"position_ids": torch.tensor(position_ids, dtype=torch.long),
}
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "stabilityai/japanese-stable-clip-vit-l-16"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoImageProcessor.from_pretrained(model_path)
# Run!
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(images=image, return_tensors="pt").to(device)
text = tokenize(
tokenizer=tokenizer,
texts=["犬", "猫", "象"],
).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [[1.0, 0.0, 0.0]]
Model Details
- Developed by: Stability AI
- Model type: Contrastive Image-Text, Zero-Shot Image Classification
- Language(s): Japanese
- License: STABILITY AI JAPANESE STABLE CLIP COMMUNITY LICENSE.
Model | ImageNet top-1 accuracy* |
---|---|
Japanese Stable CLIP ViT-L/16 | 62.06 |
rinna/japanese-cloob-vit-b-16 | 54.64 |
laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k | 53 |
rinna/japanese-clip-vit-b-16 | 50.69 |
* Computed scores based on https://github.com/rinnakk/japanese-clip.
Training
The model uses a ViT-L/16 Transformer architecture as an image encoder and a 12-layer BERT as a text encoder with the Japanese tokenizer from rinna/japanese-roberta-base. During training, the image encoder was initialized from the AugReg vit-large-patch16-224 model and we applied SigLIP (Sigmoid loss for Language-Image Pre-training).
Training Dataset
The training dataset includes the following public datasets:
- CC12M with captions translated into Japanese
- MS-COCO with STAIR Captions
Use and Limitations
Intended Use
This model is intended to be used by the open-source community in vision-language applications.
Limitations and bias
The training dataset may have contained offensive or inappropriate content even though we applied data filters. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups.
How to cite
@misc{JapaneseStableCLIP,
url = {[https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16)},
title = {Japanese Stable CLIP ViT-L/16},
author = {Shing, Makoto and Akiba, Takuya}
}