The inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU and GPU

#3
by Kinfai - opened

Use the following code to test inference time, the inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU(12th Gen Intel(R) Core(TM) i7-12700K) and GPU(NVIDIA GeForce RTX 3090). Is this expected?

device = "cuda:0"
# device = "cpu"

model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s2',
                                                              pretrained='checkpoints/mobileclip_s2.pt',
                                                              device=device)
model.eval()

image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)

infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
    with torch.no_grad(), torch.amp.autocast('cuda'):
        start.record()
        # start_t = time.time()
        image_features = model.encode_image(image)
        end.record()
        torch.cuda.synchronize()
        # end_t = time.time()
    infer_t += start.elapsed_time(end)
    # infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')

import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-256',
                                                             pretrained='datacomp_s34b_b86k',
                                                             device=device)
model.eval()

image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)

infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
    with torch.no_grad(), torch.amp.autocast('cuda'):
        start.record()
        # start_t = time.time()
        image_features = model.encode_image(image)
        end.record()
        torch.cuda.synchronize()
        # end_t = time.time()
    infer_t += start.elapsed_time(end)
    # infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')

mobileclip_s2
torch_gpu 18.90664233722687 ms/frame
torch_cpu 170.6237115383148 ms/frame

openclip_vit_b_32_256
torch_gpu 6.30794669418335 ms/frame
torch_cpu 114.05081667900086 ms/frame

Any updates on this? I see similar performance issues on torch/cuda and onnxruntime providers.

Apologies for the delayed response. We benchmarked our models on the neural engine of the iPhone 12 Pro Max using Core ML. For achieving optimal performance on NVIDIA GPUs, I recommend using TensorRT, as its kernels appear to be more effectively optimized for depthwise/grouped convolutions.

Sign up or log in to comment