liuhaotian/llava-v1.6-mistral-7b · Potential ways to accelerate for image to text tasks?

I have been trying to utilize this model for image-to-text tasks, and by far it seems that this model provides better qualities compared to other alternative models, such as Blip2 etc, regarding included details, accuracies and the ability of summarizing context. However the issue is it takes too much time, for me with a single image (resized to <= 300 for w/h) it will takes ~20-30 sec on average. I wonder if there's any possible ways to improve the inference process though? currently I have tried with 1 A10G with 24G mem or 4 A10G cards but seems the processing time doesn't change too much