File size: 5,232 Bytes
ced1fb2
 
 
d0c75ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
library_name: keras-hub
---
### Model Overview
# Model Summary

This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online.  It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.


Weights are released under the [MIT License](https://opensource.org/license/mit). Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE).

## Links

* [CLIP Quickstart Notebook](https://www.kaggle.com/code/divyasss/clip-quickstart-single-shot-classification)
* [CLIP API Documentation](https://keras.io/api/keras_cv/models/clip/)
* [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)

## Installation

Keras and KerasCV can be installed with:

```
pip install -U -q keras-cv
pip install -U -q keras>=3
```

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

## Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
| Preset name                | Parameters | Description                                                                                                                                                                                                                                                                                                       |
|----------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| clip-vit-base-patch16      | 149.62M    | The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) |
| clip-vit-base-patch32      | 151.28M    | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)  |
| clip-vit-large-patch14     | 427.62M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)  |
| clip-vit-large-patch14-336 | 427.94M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)  |

## Example code
```
from keras import ops
import keras
from keras_cv.models.feature_extractor.clip import CLIPProcessor
from keras_cv.models import CLIP

processor = CLIPProcessor("vocab.json", "merges.txt")
# processed_image = transform_image("cat.jpg", 224)
tokens = processor(["mountains", "cat on tortoise", "house"])
model = CLIP.from_preset("clip-vit-base-patch32")
output = model({
                "images": processed_image,
                "token_ids": tokens['token_ids'],
                "padding_mask": tokens['padding_mask']})


# optional if you need to pre process image
def transform_image(image_path, input_resolution):
    mean = ops.array([0.48145466, 0.4578275, 0.40821073])
    std = ops.array([0.26862954, 0.26130258, 0.27577711])

    image = keras.utils.load_img(image_path)
    image = keras.utils.img_to_array(image)
    image = (
        ops.image.resize(
            image,
            (input_resolution, input_resolution),
            interpolation="bicubic",
        )
        / 255.0
    )
    central_fraction = input_resolution / image.shape[0]
    width, height = image.shape[0], image.shape[1]
    left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
    top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
    right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
    bottom = ops.cast(
        (height + height * central_fraction) / 2, dtype="int32"
    )

    image = ops.slice(
        image, [left, top, 0], [right - left, bottom - top, 3]
    )

    image = (image - mean) / std
    return ops.expand_dims(image, axis=0)
```