File size: 1,675 Bytes
ddd0a29
 
 
 
 
 
f82bd5f
 
 
e3b7c1b
 
 
 
 
 
 
 
 
 
 
 
a97e237
e3b7c1b
 
 
 
 
 
 
 
 
 
 
 
dc002d3
e3b7c1b
 
 
 
dc5a3b2
4503c18
dc5a3b2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
license: mit
library_name: open_clip
pipeline_tag: zero-shot-image-classification
---

CoreML versions of [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](/laion/CLIP-ViT-H-14-laion2B-s32B-b79K). 

On my baseline M1 they run about 4x faster than the equivalent pytorch models run on the `mps` device (~6 image embeddings per second vs 1.5 images/sec for torch+mps), and according to `asitop` profiling, using about 3/4 of the energy to do so (6W average vs 8W for torch+mps).

There are separate models for the image and text encoders. Sorry, I don't know how to put them both into one file. 

Conversion code is in [clip-to-coreml.ipynb](clip-to-coreml.ipynb).

# Usage

You'll need to use the original CLIP preprocessor (or write your own preprocessing). eg:

```
from transformers import CLIPProcessor
import coremltools as ct
from PIL import Image

preprocessor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

model_coreml_image = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.image-encoder.mlprogram')
model_coreml_text = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.text-encoder.mlprogram')

image = Image.open("example.jpg")
preprocessed_image = preprocessor(text=None, images=image, return_tensors="pt", padding=True)
image_embedding = model_coreml.predict({'input_image_preprocessed': preprocessed_image.pixel_values})['output_embedding']

text = 'example text'
preprocessed_text = preprocessor(text=text, images=None, return_tensors="pt", padding=True)
text_embedding = model_coreml_text.predict({'input_text_token_ids': preprocessed_text.input_ids})['output_embedding'])
```

Please credit me if you use this.

---
license: mit
---