Error converting Salesforce/blip-image-captioning-base
Hello,
I am very new to HuggingFace and machine learning in general. I understand that the Blip model is not supported for conversion to coreml. Is there a way I can write my own conversion code?
Thanks
Conversion Settings:
Model: Salesforce/blip-image-captioning-base
Task: None
Framework: None
Compute Units: None
Precision: None
Tolerance: None
Push to: None
Error: "blip is not supported yet. Only ['bart', 'beit', 'bert', 'big_bird', 'bigbird_pegasus', 'blenderbot', 'blenderbot_small', 'bloom', 'convnext', 'ctrl', 'cvt', 'data2vec', 'distilbert', 'ernie', 'gpt2', 'gpt_neo', 'levit', 'm2m_100', 'marian', 'mobilebert', 'mobilevit', 'mvp', 'pegasus', 'plbart', 'roberta', 'roformer', 'segformer', 'splinter', 'squeezebert', 't5', 'vit', 'yolos'] are supported. If you want to support blip please propose a PR or open up an issue."
Hello @99s42m!
Thanks for reporting this! We'll take a look and see if we can add support for blip
soon. Meanwhile, you could try to use coremltools
directly. coremltools
is a Python package created by Apple that can convert PyTorch and Tensorflow models to Core ML. This conversion Space is based on exporters
, which in turn uses coremltools
under the hood.
@pcuenq Thank you so much for your response.
Here's where I have gotten so far:
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
img_url = 'https://images.nationalgeographic.org/image/upload/t_edhub_resource_key_image/v1638882947/EducationHub/photos/tourists-at-victoria-falls.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "The main geographical feature in this photo is a"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens = 20)
print(processor.decode(out[0], skip_special_tokens=True))
import coremltools as ct
import torch
import torchvision
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, inputs['input_ids'])
out = traced_model(example_input)
)
The above code throws the following error:
RuntimeError: Input type (long int) and bias type (float) should be the same
I understand that you are busy and this might be a basic question, but any help would be greatly appreciated.