|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
metrics: |
|
- code_eval |
|
library_name: transformers |
|
pipeline_tag: image-to-text |
|
tags: |
|
- text-generation-inference |
|
--- |
|
<u><b>We are creating a spatial aware vision-language(VL) model.</b></u> |
|
|
|
This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image. |
|
|
|
This is a sequence to sequence model for image-captioning. The architecture is <u><b>ViT encoder and GPT2 decoder.</b></u> |
|
|
|
<details> |
|
<summary>Requirements!</summary> |
|
- 4GB GPU RAM. |
|
- CUDA enabled docker |
|
</details> |
|
|
|
The way to download and run this: |
|
```python |
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
from transformers import pipeline |
|
image_captioner = pipeline("image-to-text", model="voxreality/rgb-language_cap", max_new_tokens=200, device=device) |
|
filename = 'path/to/file' |
|
generated_captions = image_captioner(filename) |
|
print(generated_captions) |
|
``` |
|
|
|
The model is trained to produce as many words as possible with a maximum of 200 tokens, which translates to roughly 5 sentences, while the 6th sentence is usually cropped. |
|
|
|
<i>The output is always of that form: "Object1" is to the "Left/Right etc." of the "Object2".</i> |
|
|
|
## IF YOU WANT TO PRODUCE A SPECIFIC NUMBER OF CAPTIONS UP TO 5. |
|
```python |
|
import os |
|
def print_up_to_n_sentences(captions, n): |
|
for caption in captions: |
|
generated_text = caption.get('generated_text', '') |
|
sentences = generated_text.split('.') |
|
result = '.'.join(sentences[:n]) |
|
#print(result) |
|
return result |
|
filename = 'path/to/file' |
|
|
|
generated_captions = image_captioner(filename) |
|
captions = print_up_to_n_sentences(generated_captions, 5) |
|
print(captions) |
|
``` |
|
|