Chinese-CLIP-RN50

Introduction

This is the smallest model of the Chinese CLIP series, with ResNet-50 as the image encoder and RBT3 as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP

Use with the official API

We provide a simple code snippet to show how to use the API for Chinese-CLIP. For starters, please install cn_clip:

# to install the latest stable release
pip install cn_clip

# or install from source code
cd Chinese-CLIP
pip install -e .

After installation, use Chinese CLIP as shown below:

import torch
from PIL import Image

import cn_clip.clip as clip
from cn_clip.clip import load_from_name, available_models
print("Available models:", available_models())  
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("RN50", device=device, download_root='./')
model.eval()
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    # Normalize the features. Please use the normalized features for downstream tasks.
    image_features /= image_features.norm(dim=-1, keepdim=True) 
    text_features /= text_features.norm(dim=-1, keepdim=True)      

    logits_per_image, logits_per_text = model.get_similarity(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # [[1.268734e-03 5.436878e-02 6.795761e-04 9.436829e-01]]

However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.

Results

MUGE Text-to-Image Retrieval:

SetupZero-shotFinetune
MetricR@1R@5R@10MRR@1R@5R@10MR
Wukong42.769.078.063.252.777.985.672.1
R2D249.575.783.269.560.182.989.477.5
CN-CLIP63.084.189.278.868.988.793.183.6

Flickr30K-CN Retrieval:

TaskText-to-ImageImage-to-Text
SetupZero-shotFinetuneZero-shotFinetune
MetricR@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
Wukong51.778.986.377.494.597.076.194.897.592.799.199.6
R2D260.986.892.784.496.798.477.696.798.995.699.8100.0
CN-CLIP71.291.495.583.896.998.681.697.598.895.399.7100.0

COCO-CN Retrieval:

TaskText-to-ImageImage-to-Text
SetupZero-shotFinetuneZero-shotFinetune
MetricR@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
Wukong53.480.290.174.094.498.155.281.090.673.394.098.0
R2D256.485.093.179.196.598.963.389.395.779.397.198.7
CN-CLIP69.289.996.181.596.999.163.086.692.983.597.399.2

Zero-shot Image Classification:

TaskCIFAR10CIFAR100DTDEuroSATFERFGVCKITTIMNISTPCVOC
GIT88.561.142.943.441.46.722.168.950.080.2
ALIGN94.976.866.152.150.825.041.274.055.283.0
CLIP94.977.056.063.048.333.311.579.062.384.0
Wukong95.477.140.950.3------
CN-CLIP96.079.751.252.055.126.249.979.463.584.9

Citation

If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.