We are creating a spatial aware vision-language(VL) model.

This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.

This is a sequence to sequence model for visual question-answering. The architecture is BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)

Requirements! - 4GB GPU RAM. - CUDA enabled docker

The way to download and run this:

from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Specify the path to the directory where the model was saved
model_path = "voxeality/rgb-language_vqa"
# Load the model
model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16)
question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object"
image_path= 'path/to/file'
image = Image.open(image_path).convert("RGB")

# Load the processor used during training for consistent preprocessing
processor = BlipProcessor.from_pretrained(model_path)
# prepare inputs
encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**encoding, max_new_tokens=200)
generated_text = processor.decode(out[0], skip_special_tokens=True)

The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image.

The output of this dialogue is either of that form:

Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2".


Q. What is below the "Object1". A. an "Object2".

