apple/MobileCLIP-S2-OpenCLIP · The inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU and GPU

Sep 19, 2024

Use the following code to test inference time, the inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU(12th Gen Intel(R) Core(TM) i7-12700K) and GPU(NVIDIA GeForce RTX 3090). Is this expected?

device = "cuda:0"
# device = "cpu"

model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s2',
                                                              pretrained='checkpoints/mobileclip_s2.pt',
                                                              device=device)
model.eval()

image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)

infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
    with torch.no_grad(), torch.amp.autocast('cuda'):
        start.record()
        # start_t = time.time()
        image_features = model.encode_image(image)
        end.record()
        torch.cuda.synchronize()
        # end_t = time.time()
    infer_t += start.elapsed_time(end)
    # infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')

import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-256',
                                                             pretrained='datacomp_s34b_b86k',
                                                             device=device)
model.eval()

image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)

infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
    with torch.no_grad(), torch.amp.autocast('cuda'):
        start.record()
        # start_t = time.time()
        image_features = model.encode_image(image)
        end.record()
        torch.cuda.synchronize()
        # end_t = time.time()
    infer_t += start.elapsed_time(end)
    # infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')

mobileclip_s2
torch_gpu 18.90664233722687 ms/frame
torch_cpu 170.6237115383148 ms/frame

openclip_vit_b_32_256
torch_gpu 6.30794669418335 ms/frame
torch_cpu 114.05081667900086 ms/frame

darkmentat

Dec 5, 2024

Any updates on this? I see similar performance issues on torch/cuda and onnxruntime providers.

pavankumarvasu

Apple org Dec 10, 2024

Apologies for the delayed response. We benchmarked our models on the neural engine of the iPhone 12 Pro Max using Core ML. For achieving optimal performance on NVIDIA GPUs, I recommend using TensorRT, as its kernels appear to be more effectively optimized for depthwise/grouped convolutions.