voxreality
/

rgb_language_cap

vision-encoder-decoder

image-text-to-text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

rgb_language_cap / README.md

VCL3D's picture

Update README.md

57d44a7 verified 6 months ago

|

history blame contribute delete

1.73 kB

	---
	license: apache-2.0
	language:
	- en
	metrics:
	- code_eval
	library_name: transformers
	pipeline_tag: image-to-text
	tags:
	- text-generation-inference
	---
	<u><b>We are creating a spatial aware vision-language(VL) model.</b></u>

	This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.

	This is a sequence to sequence model for image-captioning. The architecture is <u><b>ViT encoder and GPT2 decoder.</b></u>

	<details>
	<summary>Requirements!</summary>
	- 4GB GPU RAM.
	- CUDA enabled docker
	</details>

	The way to download and run this:
	```python
	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
	from transformers import pipeline
	image_captioner = pipeline("image-to-text", model="voxreality/rgb-language_cap", max_new_tokens=200, device=device)
	filename = 'path/to/file'
	generated_captions = image_captioner(filename)
	print(generated_captions)
	```

	The model is trained to produce as many words as possible with a maximum of 200 tokens, which translates to roughly 5 sentences, while the 6th sentence is usually cropped.

	<i>The output is always of that form: "Object1" is to the "Left/Right etc." of the "Object2".</i>

	## IF YOU WANT TO PRODUCE A SPECIFIC NUMBER OF CAPTIONS UP TO 5.
	```python
	import os
	def print_up_to_n_sentences(captions, n):
	for caption in captions:
	generated_text = caption.get('generated_text', '')
	sentences = generated_text.split('.')
	result = '.'.join(sentences[:n])
	#print(result)
	return result
	filename = 'path/to/file'

	generated_captions = image_captioner(filename)
	captions = print_up_to_n_sentences(generated_captions, 5)
	print(captions)
	```