|
--- |
|
license: mit |
|
library_name: open_clip |
|
pipeline_tag: zero-shot-image-classification |
|
--- |
|
|
|
CoreML versions of [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](/laion/CLIP-ViT-H-14-laion2B-s32B-b79K). |
|
|
|
On my baseline M1 they run about 4x faster than the equivalent pytorch models run on the `mps` device (~6 image embeddings per second vs 1.5 images/sec for torch+mps), and according to `asitop` profiling, using about 3/4 of the energy to do so (6W average vs 8W for torch+mps). |
|
|
|
There are separate models for the image and text encoders. Sorry, I don't know how to put them both into one file. |
|
|
|
Conversion code is in [clip-to-coreml.ipynb](clip-to-coreml.ipynb). |
|
|
|
# Usage |
|
|
|
You'll need to use the original CLIP preprocessor (or write your own preprocessing). eg: |
|
|
|
``` |
|
from transformers import CLIPProcessor |
|
import coremltools as ct |
|
from PIL import Image |
|
|
|
preprocessor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K") |
|
|
|
model_coreml_image = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.image-encoder.mlprogram') |
|
model_coreml_text = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.text-encoder.mlprogram') |
|
|
|
image = Image.open("example.jpg") |
|
preprocessed_image = preprocessor(text=None, images=image, return_tensors="pt", padding=True) |
|
image_embedding = model_coreml.predict({'input_image_preprocessed': preprocessed_image.pixel_values})['output_embedding'] |
|
|
|
text = 'example text' |
|
preprocessed_text = preprocessor(text=text, images=None, return_tensors="pt", padding=True) |
|
text_embedding = model_coreml_text.predict({'input_text_token_ids': preprocessed_text.input_ids})['output_embedding']) |
|
``` |
|
|
|
Please credit me if you use this. |
|
|
|
--- |
|
license: mit |
|
--- |
|
|