jinaai/jina-clip-v2 · Runtime when using onnxruntime

2 days ago

Environment:
onnx 1.17.0
onnxconverter-common 1.14.0
onnxruntime-gpu 1.20.1
skl2onnx 1.17.0
tf2onnx 1.16.1
CUDA Version: 12.3

Hello, I want to use jina-clip-v2 via the ONNX Runtime.
However, when I try to execute the example code, there is a RuntimeError:

RuntimeError: Input must be a list of dictionaries or a single numpy array for input 'pixel_values'.

How can I solve this?

gmastrapas

Jina AI org 2 days ago

Hey @EdisonEx33 ! Can you share a code snippet? And the error trace?

EdisonEx33

1 day ago

Thanks for your replying!

By adding images = [Image.open(requests.get(image_url, stream=True).raw) for image_url in image_urls] and pixel_values = np.array(pixel_values), and replace /share/model/jina-clip-v2/onnx/model.onnx with /share/model/jina-clip-v2/onnx/model_fp16.onnx, I can run jina-clip-v2.

# !pip install transformers onnxruntime pillow
import onnxruntime as ort
from transformers import AutoImageProcessor, AutoTokenizer
import numpy as np
from PIL import Image
import requests
# Load tokenizer and image processor using transformers
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-clip-v2', trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(
    'jinaai/jina-clip-v2', trust_remote_code=True
)

# Corpus
sentences = [
    'غروب جميل على الشاطئ', # Arabic
]

# Public image URLs or PIL Images
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg']
# Load images from url
images = [Image.open(requests.get(image_url, stream=True).raw) for image_url in image_urls]

# Tokenize input texts and transform input images
input_ids = tokenizer(sentences, return_tensors='np')['input_ids']
pixel_values = image_processor(images)['pixel_values']
print(pixel_values.shape)

pixel_values = np.array(pixel_values)

# Start an ONNX Runtime Session
session = ort.InferenceSession('/share/model/jina-clip-v2/onnx/model_fp16.onnx')

# Run inference
from time import time

t0 = time()
output = session.run(None, {'input_ids': input_ids, 'pixel_values': pixel_values})
t1 = time()

print(f"Costs 1: {t1 - t0}")

# Keep the normalised embeddings, first 2 outputs are un-normalized
_, _, text_embeddings, image_embeddings = output

However, if I only want to get image embeddings, how can I modify the code?
Simply run output = session.run(None, {'pixel_values': pixel_values}) will get an error:

Traceback (most recent call last):
  File "test_jina_clip.py", line 37, in <module>
    output = session.run(None, {'pixel_values': pixel_values})
  File "~/miniconda3/envs/hy_onnx/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 262, in run
    self._validate_input(list(input_feed.keys()))
  File "~/miniconda3/envs/hy_onnx/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 244, in _validate_input
    raise ValueError(
ValueError: Required inputs (['input_ids']) are missing from input feed (['pixel_values']).