File size: 4,342 Bytes
a0fd4bb
071945c
a0fd4bb
071945c
 
 
 
a0fd4bb
071945c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff0d639
071945c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
language: ja
license: apache-2.0
tags:
- clip
- japanese-clip
pipeline_tag: feature-extraction
---

# clip-japanese-base

This is a Japanese [CLIP (Contrastive Language-Image Pre-training)](https://arxiv.org/abs/2103.00020) model developed by [LY Corporation](https://www.lycorp.co.jp/en/). This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.

## How to use
1. Install packages
```
pip install pillow requests sentencepiece transformers torch timm
```
2. Run
```python
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
device = "cuda" if torch.cuda.is_available() else "cpu"

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt")
text = tokenizer(["犬", "猫", "象"])

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]
```

## Model architecture

The model uses an [Eva02-B](https://huggingface.co/timm/eva02_base_patch16_clip_224.merged2b_s8b_b131k) Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16).

## Evaluation
### Dataset
- [STAIR Captions](http://captions.stair.center/) (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
- [Recruit Datasets](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset) for image classification.
- [ImageNet-1K](https://www.image-net.org/download.php) for image classification. We translated all classnames into Japanese. The classnames and templates can be found in [ja-imagenet-1k-classnames.txt](https://huggingface.co/line-corporation/clip-japanese-base/blob/main/ja-imagenet-1k-classnames.txt) and [ja-imagenet-1k-templates.txt](https://huggingface.co/line-corporation/clip-japanese-base/blob/main/ja-imagenet-1k-templates.txt).

### Result
| Model             | Image Encoder Params | Text Encoder params | STAIR Captions (R@1) | Recruit Datasets (acc@1) | ImageNet-1K (acc@1) |
|-------------------|----------------------|---------------------|----------------|------------------|-----------------|
| [Ours](https://huggingface.co/line-corporation/clip-japanese-base) | 86M(Eva02-B)         | 100M(BERT)          | **0.30**           | **0.89**             | 0.58            |
| [Stable-ja-clip](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16)    | 307M(ViT-L)          | 100M(BERT)          | 0.24           | 0.77             | **0.68**            |
| [Rinna-ja-clip](https://huggingface.co/rinna/japanese-clip-vit-b-16)     | 86M(ViT-B)           | 100M(BERT)         | 0.13           | 0.54             | 0.56            |
| [Laion-clip](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k)        | 632M(ViT-H)          | 561M(XLM-RoBERTa)   | **0.30**           | 0.83             | 0.58            |
| [Hakuhodo-ja-clip](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-wider)  | 632M(ViT-H)          | 100M(BERT)          | 0.21           | 0.82             | 0.46            |

## Licenses

[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Citation
```
@misc{clip-japanese-base,
    title = {CLIP Japanese Base},
    author={Shuhei Yokoo, Shuntaro Okada, Peifei Zhu, Shuhei Nishimura and Naoki Takayama}
    url = {https://huggingface.co/line-corporation/clip-japanese-base},
}
```