keras
/

clip_vit_large_patch14_336

KerasHub

Model card Files Files and versions Community

Divyasreepat commited on Nov 15, 2024

Commit

c91e634

verified ·

1 Parent(s): 4ec3ea9

Update README.md with new model card content

Browse files

Files changed (1) hide show

README.md +84 -9

README.md CHANGED Viewed

@@ -1,12 +1,87 @@
 ---
 library_name: keras-hub
 ---
-This is a [`CLIP` model](https://keras.io/api/keras_hub/models/clip) uploaded using the KerasHub library and can be used with JAX, TensorFlow, and PyTorch backends.
-Model config:
-* **name:** clip_backbone
-* **trainable:** True
-* **vision_encoder:** {'module': 'keras_hub.src.models.clip.clip_vision_encoder', 'class_name': 'CLIPVisionEncoder', 'config': {'name': 'clip_vision_encoder', 'trainable': True, 'patch_size': 14, 'hidden_dim': 1024, 'num_layers': 24, 'num_heads': 16, 'intermediate_dim': 4096, 'intermediate_activation': 'quick_gelu', 'intermediate_output_index': None, 'image_shape': [336, 336, 3]}, 'registered_name': 'keras_hub>CLIPVisionEncoder'}
-* **text_encoder:** {'module': 'keras_hub.src.models.clip.clip_text_encoder', 'class_name': 'CLIPTextEncoder', 'config': {'name': 'clip_text_encoder', 'trainable': True, 'vocabulary_size': 49408, 'embedding_dim': 768, 'hidden_dim': 768, 'num_layers': 12, 'num_heads': 12, 'intermediate_dim': 3072, 'intermediate_activation': 'quick_gelu', 'intermediate_output_index': None, 'max_sequence_length': 77}, 'registered_name': 'keras_hub>CLIPTextEncoder'}
-* **projection_dim:** 768
-This model card has been generated automatically and should be completed by the model author. See [Model Cards documentation](https://huggingface.co/docs/hub/model-cards) for more information.

 ---
 library_name: keras-hub
 ---
+### Model Overview
+# Model Summary
+This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online.  It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.
+Weights are released under the [MIT License](https://opensource.org/license/mit). Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE).
+## Links
+* [CLIP Quickstart Notebook](https://www.kaggle.com/code/divyasss/clip-quickstart-single-shot-classification)
+* [CLIP API Documentation](https://keras.io/api/keras_cv/models/clip/)
+* [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
+## Installation
+Keras and KerasCV can be installed with:
+```
+pip install -U -q keras-cv
+pip install -U -q keras>=3
+```
+Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
+## Presets
+The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
+| Preset name                | Parameters | Description                                                                                                                                                                                                                                                                                                       |
+|----------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| clip-vit-base-patch16      | 149.62M    | The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) |
+| clip-vit-base-patch32      | 151.28M    | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)  |
+| clip-vit-large-patch14     | 427.62M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)  |
+| clip-vit-large-patch14-336 | 427.94M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)  |
+## Example code
+```
+from keras import ops
+import keras
+from keras_cv.models.feature_extractor.clip import CLIPProcessor
+from keras_cv.models import CLIP
+processor = CLIPProcessor("vocab.json", "merges.txt")
+# processed_image = transform_image("cat.jpg", 224)
+tokens = processor(["mountains", "cat on tortoise", "house"])
+model = CLIP.from_preset("clip-vit-base-patch32")
+output = model({
+                "images": processed_image,
+                "token_ids": tokens['token_ids'],
+                "padding_mask": tokens['padding_mask']})
+# optional if you need to pre process image
+def transform_image(image_path, input_resolution):
+    mean = ops.array([0.48145466, 0.4578275, 0.40821073])
+    std = ops.array([0.26862954, 0.26130258, 0.27577711])
+    image = keras.utils.load_img(image_path)
+    image = keras.utils.img_to_array(image)
+    image = (
+        ops.image.resize(
+            image,
+            (input_resolution, input_resolution),
+            interpolation="bicubic",
+        )
+        / 255.0
+    )
+    central_fraction = input_resolution / image.shape[0]
+    width, height = image.shape[0], image.shape[1]
+    left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
+    top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
+    right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
+    bottom = ops.cast(
+        (height + height * central_fraction) / 2, dtype="int32"
+    )
+    image = ops.slice(
+        image, [left, top, 0], [right - left, bottom - top, 3]
+    )
+    image = (image - mean) / std
+    return ops.expand_dims(image, axis=0)
+```