File size: 6,929 Bytes
be4c9a5 07ddef8 e1c09c5 be4c9a5 82f1345 e1c09c5 07ddef8 e1c09c5 82f1345 391228d 82f1345 391228d 82f1345 391228d 82f1345 391228d 82f1345 391228d 82f1345 391228d 82f1345 e1c09c5 07ddef8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
license: cc-by-4.0
language:
- ja
pipeline_tag: feature-extraction
tags:
- clip
- japanese-clip
---
# recruit-jp/japanese-clip-vit-b-32-roberta-base
## Overview
* **Developed by**: [Recruit Co., Ltd.](https://huggingface.co/recruit-jp)
* **Model type**: Contrastive Language-Image Pretrained Model
* **Language(s)**: Japanese
* **LICENSE**: CC-BY-4.0
More details are described in our tech blog post.
* [日本語CLIP学習済みモデルとその評価用データセットの公開](https://blog.recruit.co.jp/data/articles/japanese-clip/)
## Model Details
This model is a Japanese [CLIP](https://arxiv.org/abs/2103.00020). Using this model, you can map Japanese texts and images into the same embedding space.
You can use this model for tasks such as zero-shot image classification, text-image retrieval, image feature extraction, and so on.
This model uses the image encoder of [laion/CLIP-ViT-B-32-laion2B-s34B-b79K](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K) for image encoder and [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) for text encoder.
This model is trained on Japanese subset of [LAION2B-multi dataset](https://huggingface.co/datasets/laion/laion2B-multi) and is tailored for Japanese language.
## How to use
1. Install packages
```shell
pip install pillow requests transformers torch torchvision sentencepiece
```
2. Run the code below
```python
import io
import requests
import torch
import torchvision
from PIL import Image
from transformers import AutoTokenizer, AutoModel
model_name = "recruit-jp/japanese-clip-vit-b-32-roberta-base"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
def _convert_to_rgb(image):
return image.convert('RGB')
preprocess = torchvision.transforms.Compose([
torchvision.transforms.Resize(size=224, interpolation=torchvision.transforms.InterpolationMode.BICUBIC, max_size=None),
torchvision.transforms.CenterCrop(size=(224, 224)),
_convert_to_rgb,
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
])
def tokenize(tokenizer, texts):
texts = ["[CLS]" + text for text in texts]
encodings = [
# NOTE: the maximum token length that can be fed into this model is 77
tokenizer(text, max_length=77, padding="max_length", truncation=True, add_special_tokens=False)["input_ids"]
for text in texts
]
return torch.LongTensor(encodings)
# Run!
image = Image.open(
io.BytesIO(
requests.get(
'https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260'
).content
)
)
image = preprocess(image).unsqueeze(0).to(device)
text = tokenize(tokenizer, texts=["犬", "猫", "象"]).to(device)
with torch.inference_mode():
image_features = model.get_image_features(image)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features = model.get_text_features(input_ids=text)
text_features /= text_features.norm(dim=-1, keepdim=True)
probs = image_features @ text_features.T
print("Label probs:", probs.cpu().numpy()[0])
```
## Model Performance
We've conducted model performance evaluation on the datasets listed below.
Since ImageNet V2 and Food101 are datasets from English speaking context, we translated the class label into Japanese before we conduct evaluation.
* [ImageNet V2](https://github.com/modestyachts/ImageNetV2_pytorch) test set (Top-1 Accuracy)
* [Food101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) (Top-1 Accuracy)
* [Hiragana dataset from ETL Character Database](http://etlcdb.db.aist.go.jp/?lang=ja) (Top-1 Accuracy)
* [Katakana dataset from ETL Character Database](http://etlcdb.db.aist.go.jp/?lang=ja) (Top-1 Accuracy)
* [STAIR Captions](http://captions.stair.center/) Image-to-Text Retrieval (Average of Precision@1,5,10)
* [STAIR Captions](http://captions.stair.center/) Text-to-Image Retrieval (Average of Precision@1,5,10)
* [jafood101](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jafood101.csv) (Top-1 Accuracy)
* [jaflower30](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jaflower30.csv) (Top-1 Accuracy)
* [jafacility20](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jafacility20.csv) (Top-1 Accuracy)
* [jalandmark10](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset/blob/main/jalandmark10.csv) (Top-1 Accuracy)
We also evaluated [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k), [laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k), [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16) and [stabilityai/japanese-stable-clip-vit-l-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) on the same datasets.
Note that stabilityai/japanese-stable-clip-vit-l-16 is trained on STAIR Captions dataset, we skipped evaluation of stability's model on STAIR Captions.
| **Model** | **ImageNet V2** | **Food101** | **ETLC-hiragana** | **ETLC-katakana** | **STAIR Captions image-to-text** | **STAIR Captions text-to-image** | **jafood101**| **jaflower30** | **jafacility20** | **jalandmark10** |
|:---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k|**0.471**|**0.742**|0.055|0.029|**0.462**|**0.223**|**0.709**|**0.869**|**0.820**|**0.899**|
|laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k|0.326|0.508|**0.162**|**0.061**|0.372|0.169|0.609|0.709|0.749|0.846|
|rinna/japanese-clip-vit-b-16|0.435|0.491|0.014|0.024|0.089|0.034|0.308|0.592|0.406|0.656|
|stabilityai/japanese-stable-clip-vit-l-16|0.481|0.460|0.013|0.023|-|-|0.413|0.689|0.677|0.752|
|recruit-jp/japanese-clip-vit-b-32-roberta-base|0.175|0.301|0.030|0.038|0.191|0.102|0.524|0.592|0.676|0.797|
## Training Dataset
This model is trained with 128M image-text pairs from the Japanese subset of [LAION2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) dataset.
## Disclaimer
㈱リクルートは、本モデル利用による成果に関し、正確性、有用性、確実性、違法性の確認及び何らの保証および補償を行わないものとし、また、モデル利用によって利用者に生じた損害および第三者との間における紛争について㈱リクルートは一切責任を負いません。
|