Divyasreepat's picture
Update README.md with new model card content
d0c75ec verified
metadata
library_name: keras-hub

Model Overview

Model Summary

This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.

Weights are released under the MIT License. Keras model code is released under the Apache 2 License.

Links

Installation

Keras and KerasCV can be installed with:

pip install -U -q keras-cv
pip install -U -q keras>=3

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the Keras Getting Started page.

Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

Preset name Parameters Description
clip-vit-base-patch16 149.62M The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224)
clip-vit-base-patch32 151.28M The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)
clip-vit-large-patch14 427.62M The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)
clip-vit-large-patch14-336 427.94M The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)

Example code

from keras import ops
import keras
from keras_cv.models.feature_extractor.clip import CLIPProcessor
from keras_cv.models import CLIP

processor = CLIPProcessor("vocab.json", "merges.txt")
# processed_image = transform_image("cat.jpg", 224)
tokens = processor(["mountains", "cat on tortoise", "house"])
model = CLIP.from_preset("clip-vit-base-patch32")
output = model({
                "images": processed_image,
                "token_ids": tokens['token_ids'],
                "padding_mask": tokens['padding_mask']})


# optional if you need to pre process image
def transform_image(image_path, input_resolution):
    mean = ops.array([0.48145466, 0.4578275, 0.40821073])
    std = ops.array([0.26862954, 0.26130258, 0.27577711])

    image = keras.utils.load_img(image_path)
    image = keras.utils.img_to_array(image)
    image = (
        ops.image.resize(
            image,
            (input_resolution, input_resolution),
            interpolation="bicubic",
        )
        / 255.0
    )
    central_fraction = input_resolution / image.shape[0]
    width, height = image.shape[0], image.shape[1]
    left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
    top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
    right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
    bottom = ops.cast(
        (height + height * central_fraction) / 2, dtype="int32"
    )

    image = ops.slice(
        image, [left, top, 0], [right - left, bottom - top, 3]
    )

    image = (image - mean) / std
    return ops.expand_dims(image, axis=0)