license: apache-2.0
Chinese-CLIP-ViT-Large-Patch14-336px
Introduction
This is the large-version of the Chinese CLIP, with ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP
Use with the official API
We provide a simple code snippet to show how to use the API for Chinese-CLIP. For starters, please install cn_clip:
# to install the latest stable release
pip install cn_clip
# or install from source code
cd Chinese-CLIP
pip install -e .
After installation, use Chinese CLIP as shown below:
import torch
from PIL import Image
import cn_clip.clip as clip
from cn_clip.clip import load_from_name, available_models
print("Available models:", available_models())
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-L-14-336", device=device, download_root='./')
model.eval()
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize the features. Please use the normalized features for downstream tasks.
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
logits_per_image, logits_per_text = model.get_similarity(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # [[1.268734e-03 5.436878e-02 6.795761e-04 9.436829e-01]]
However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.
Results
MUGE Text-to-Image Retrieval:
Setup | Zero-shot | Finetune | ||||||
---|---|---|---|---|---|---|---|---|
Metric | R@1 | R@5 | R@10 | MR | R@1 | R@5 | R@10 | MR |
Wukong | 42.7 | 69.0 | 78.0 | 63.2 | 52.7 | 77.9 | 85.6 | 72.1 |
R2D2 | 49.5 | 75.7 | 83.2 | 69.5 | 60.1 | 82.9 | 89.4 | 77.5 |
CN-CLIP | 63.0 | 84.1 | 89.2 | 78.8 | 68.9 | 88.7 | 93.1 | 83.6 |
Flickr30K-CN Retrieval:
Task | Text-to-Image | Image-to-Text | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
Wukong | 51.7 | 78.9 | 86.3 | 77.4 | 94.5 | 97.0 | 76.1 | 94.8 | 97.5 | 92.7 | 99.1 | 99.6 |
R2D2 | 60.9 | 86.8 | 92.7 | 84.4 | 96.7 | 98.4 | 77.6 | 96.7 | 98.9 | 95.6 | 99.8 | 100.0 |
CN-CLIP | 71.2 | 91.4 | 95.5 | 83.8 | 96.9 | 98.6 | 81.6 | 97.5 | 98.8 | 95.3 | 99.7 | 100.0 |
COCO-CN Retrieval:
Task | Text-to-Image | Image-to-Text | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
Wukong | 53.4 | 80.2 | 90.1 | 74.0 | 94.4 | 98.1 | 55.2 | 81.0 | 90.6 | 73.3 | 94.0 | 98.0 |
R2D2 | 56.4 | 85.0 | 93.1 | 79.1 | 96.5 | 98.9 | 63.3 | 89.3 | 95.7 | 79.3 | 97.1 | 98.7 |
CN-CLIP | 69.2 | 89.9 | 96.1 | 81.5 | 96.9 | 99.1 | 63.0 | 86.6 | 92.9 | 83.5 | 97.3 | 99.2 |
Zero-shot Image Classification:
Task | CIFAR10 | CIFAR100 | DTD | EuroSAT | FER | FGVC | KITTI | MNIST | PC | VOC |
---|---|---|---|---|---|---|---|---|---|---|
GIT | 88.5 | 61.1 | 42.9 | 43.4 | 41.4 | 6.7 | 22.1 | 68.9 | 50.0 | 80.2 |
ALIGN | 94.9 | 76.8 | 66.1 | 52.1 | 50.8 | 25.0 | 41.2 | 74.0 | 55.2 | 83.0 |
CLIP | 94.9 | 77.0 | 56.0 | 63.0 | 48.3 | 33.3 | 11.5 | 79.0 | 62.3 | 84.0 |
Wukong | 95.4 | 77.1 | 40.9 | 50.3 | - | - | - | - | - | - |
CN-CLIP | 96.0 | 79.7 | 51.2 | 52.0 | 55.1 | 26.2 | 49.9 | 79.4 | 63.5 | 84.9 |
Citation
If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!
@article{chinese-clip,
title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
journal={arXiv preprint arXiv:2211.01335},
year={2022}
}