hidehisa-arai's picture
update README.md
e1c09c5
|
raw
history blame
6.42 kB
metadata
license: mit
language:
  - ja
pipeline_tag: feature-extraction
tags:
  - clip
  - japanese-clip

recruit-jp/japanese-clip-vit-b-32-roberta-base

Overview

  • Developed by: Recruit Co., Ltd.
  • Model type: Contrastive Language-Image Pretrained Model
  • Language(s): Japanese
  • LICENSE: MIT

More details are described in our tech blog post.

Model Details

This model is a Japanese CLIP. Using this model, you can map Japanese texts and images into the same embedding space. You can use this model for tasks such as zero-shot image classification, text-image retrieval, image feature extraction, and so on.

This model uses the image encoder of laion/CLIP-ViT-B-32-laion2B-s34B-b79K for image encoder and rinna/japanese-roberta-base for text encoder. This model is trained on Japanese subset of LAION2B-multi dataset and is tailored for Japanese language.

How to use

  1. Install packages
pip install pillow requests transformers torch torchvision sentencepiece
  1. Run the code below
import io
import requests
from PIL import Image

import torch
import torchvision
from transformers import AutoTokenizer, AutoModel

model_name = "recruit-jp/japanese-clip-vit-b-32-roberta-base"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)

def _convert_to_rgb(image):
    return image.convert('RGB')


preprocess = torchvision.transforms.Compose([
    torchvision.transforms.Resize(size=224, interpolation=torchvision.transforms.InterpolationMode.BICUBIC, max_size=None),
    torchvision.transforms.CenterCrop(size=(224, 224)),
    _convert_to_rgb,
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
])


def tokenize(tokenizer, texts):
    texts = ["[CLS]" + text for text in texts]
    encodings = [
        tokenizer(text, max_length=77, padding="max_length", truncation=True, add_special_tokens=False)["input_ids"]
        for text in texts
    ]
    return torch.LongTensor(encodings)

# Run!
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(image).unsqueeze(0).to(device)
text = tokenize(tokenizer, texts=["犬", "猫", "象"]).to(device)

with torch.inference_mode():
    image_features = model.get_image_features(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

    text_features = model.get_text_features(input_ids=text)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    probs = image_features @ text_features.T

print("Label probs:", probs.cpu().numpy()[0])

Model Performance

We've conducted model performance evaluation on the datasets listed below. Since ImageNet V2 and Food101 are datasets from English speaking context, we translated the class label into Japanese before we conduct evaluation.

We also evaluated laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k, laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k, rinna/japanese-clip-vit-b-16 and stabilityai/japanese-stable-clip-vit-l-16 on the same datasets. Note that stabilityai/japanese-stable-clip-vit-l-16 is trained on STAIR Captions dataset, we skipped evaluation of stability's model on STAIR Captions.

Model ImageNet V2 Food101 ETLC-hiragana ETLC-katakana STAIR Captions image-to-text STAIR Captions text-to-image jafood101 jaflower30 jafacility20 jalandmark10
laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k 0.471 0.742 0.055 0.029 0.462 0.223 0.709 0.869 0.820 0.899
laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k 0.326 0.508 0.162 0.061 0.372 0.169 0.609 0.709 0.749 0.846
rinna/japanese-clip-vit-b-16 0.435 0.491 0.014 0.024 0.089 0.034 0.308 0.592 0.406 0.656
stabilityai/japanese-stable-clip-vit-l-16 0.481 0.460 0.013 0.023 - - 0.413 0.689 0.677 0.752
recruit-jp/japanese-clip-vit-b-32-roberta-base 0.175 0.301 0.030 0.038 0.191 0.102 0.524 0.592 0.676 0.797

Training Dataset

This model is trained with 128M image-text pairs from the Japanese subset of LAION2B-multi dataset.