File size: 1,675 Bytes
ddd0a29 f82bd5f e3b7c1b a97e237 e3b7c1b dc002d3 e3b7c1b dc5a3b2 4503c18 dc5a3b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
---
license: mit
library_name: open_clip
pipeline_tag: zero-shot-image-classification
---
CoreML versions of [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](/laion/CLIP-ViT-H-14-laion2B-s32B-b79K).
On my baseline M1 they run about 4x faster than the equivalent pytorch models run on the `mps` device (~6 image embeddings per second vs 1.5 images/sec for torch+mps), and according to `asitop` profiling, using about 3/4 of the energy to do so (6W average vs 8W for torch+mps).
There are separate models for the image and text encoders. Sorry, I don't know how to put them both into one file.
Conversion code is in [clip-to-coreml.ipynb](clip-to-coreml.ipynb).
# Usage
You'll need to use the original CLIP preprocessor (or write your own preprocessing). eg:
```
from transformers import CLIPProcessor
import coremltools as ct
from PIL import Image
preprocessor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
model_coreml_image = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.image-encoder.mlprogram')
model_coreml_text = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.text-encoder.mlprogram')
image = Image.open("example.jpg")
preprocessed_image = preprocessor(text=None, images=image, return_tensors="pt", padding=True)
image_embedding = model_coreml.predict({'input_image_preprocessed': preprocessed_image.pixel_values})['output_embedding']
text = 'example text'
preprocessed_text = preprocessor(text=text, images=None, return_tensors="pt", padding=True)
text_embedding = model_coreml_text.predict({'input_text_token_ids': preprocessed_text.input_ids})['output_embedding'])
```
Please credit me if you use this.
---
license: mit
---
|