We are creating a spatial aware vision-language(VL) model.

This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.

This is a sequence to sequence model for image-captioning. The architecture is ViT encoder and GPT2 decoder.

Requirements! - 4GB GPU RAM. - CUDA enabled docker

The way to download and run this:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
from transformers import pipeline
image_captioner = pipeline("image-to-text", model="voxreality/rgb-language_cap", max_new_tokens=200, device=device)
filename = 'path/to/file'
generated_captions = image_captioner(filename)
print(generated_captions)

The model is trained to produce as many words as possible with a maximum of 200 tokens, which translates to roughly 5 sentences, while the 6th sentence is usually cropped.

The output is always of that form: "Object1" is to the "Left/Right etc." of the "Object2".

IF YOU WANT TO PRODUCE A SPECIFIC NUMBER OF CAPTIONS UP TO 5.

import os
def print_up_to_n_sentences(captions, n):
    for caption in captions:
        generated_text = caption.get('generated_text', '')
        sentences = generated_text.split('.')
        result = '.'.join(sentences[:n])
        #print(result)
    return result
filename = 'path/to/file'

generated_captions = image_captioner(filename)
captions = print_up_to_n_sentences(generated_captions, 5)
print(captions)
Downloads last month
69
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.