Visual Question Answering and Image Captioning using BLIP and OpenVINO

BLIP is a pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks. This tutorial considers ways to use BLIP for visual question answering and image captioning.

The complete pipeline of this demo is shown below:

Image Captioning

The following image shows an example of the input image and generated caption:

Visual Question Answering

The following image shows an example of the input image, question and answer generated by model

Notebook Contents

This folder contains notebook that show how to convert and optimize model with OpenVINO: The tutorial consists of the following parts:

Instantiate a BLIP model.
Convert the BLIP model to OpenVINO IR.
Run visual question answering and image captioning with OpenVINO.
Optimize BLIP model using NNCF
Compare original and optimized models
Launch interactive demo

Installation Instructions

This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.