Model Overview

Model Summary

This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.

Weights are released under the MIT License. Keras model code is released under the Apache 2 License.

Installation

Keras and KerasCV can be installed with:

pip install -U -q keras-cv
pip install -U -q keras>=3

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the Keras Getting Started page.

Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

Preset name	Parameters	Description
clip-vit-base-patch16	149.62M	The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224)
clip-vit-base-patch32	151.28M	The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)
clip-vit-large-patch14	427.62M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)
clip-vit-large-patch14-336	427.94M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)

Example code

from keras import ops
import keras
from keras_cv.models.feature_extractor.clip import CLIPProcessor
from keras_cv.models import CLIP

processor = CLIPProcessor("vocab.json", "merges.txt")
# processed_image = transform_image("cat.jpg", 224)
tokens = processor(["mountains", "cat on tortoise", "house"])
model = CLIP.from_preset("clip-vit-base-patch32")
output = model({
                "images": processed_image,
                "token_ids": tokens['token_ids'],
                "padding_mask": tokens['padding_mask']})


# optional if you need to pre process image
def transform_image(image_path, input_resolution):
    mean = ops.array([0.48145466, 0.4578275, 0.40821073])
    std = ops.array([0.26862954, 0.26130258, 0.27577711])

    image = keras.utils.load_img(image_path)
    image = keras.utils.img_to_array(image)
    image = (
        ops.image.resize(
            image,
            (input_resolution, input_resolution),
            interpolation="bicubic",
        )
        / 255.0
    )
    central_fraction = input_resolution / image.shape[0]
    width, height = image.shape[0], image.shape[1]
    left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
    top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
    right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
    bottom = ops.cast(
        (height + height * central_fraction) / 2, dtype="int32"
    )

    image = ops.slice(
        image, [left, top, 0], [right - left, bottom - top, 3]
    )

    image = (image - mean) / std
    return ops.expand_dims(image, axis=0)

keras
/

clip_vit_h_14_laion2b_s32b_b79k

Model Overview

Model Summary

Links

Installation

Presets

Example code