unum-cloud
/

uform-gen-chat

Visual Question Answering

text-generation

image-captioning

Inference Endpoints

Model card Files Files and versions Community

uform-gen-chat / README.md

ashvardanian's picture

Update README.md

409c1bb 6 months ago

|

raw history blame contribute delete

3.01 kB

	---
	pipeline_tag: visual-question-answering
	tags:
	- image-captioning
	- visual-question-answering
	datasets:
	- sbu_captions
	- visual_genome
	- HuggingFaceM4/VQAv2
	- ChristophSchuhmann/MS_COCO_2017_URL_TEXT
	language:
	- en
	license: apache-2.0
	base_model: unum-cloud/uform-vl-english
	---

	<h1 align="center">UForm</h1>
	<h3 align="center">
	Pocket-Sized Multimodal AI<br/>
	For Content Understanding and Generation<br/>
	</h3>

	## Description

	UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

	1. [UForm Vision Encoder](https://huggingface.co/unum-cloud/uform-vl-english)
	2. [Sheared-LLaMA-1.3B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-1.3B) manually tuned on the instructions dataset

	The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets. UForm-Gen-Chat is SFT version of [`UForm-Gen`](https://huggingface.co/unum-cloud/uform-gen) for multimodal chat.

	### Usage

	```bash
	pip install uform
	```

	For the CLI demo run the following:

	```bash
	uform-chat --model unum-cloud/uform-gen-chat --image_path=zebra.jpg
	uform-chat --model unum-cloud/uform-gen-chat --image_path=zebra.jpg --device="cuda:0" --fp16
	```

	Or if you want to use the model in your code:

	```python
	from uform.gen_model import VLMForCausalLM, VLMProcessor

	model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen-chat")
	processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen-chat")

	prompt = "What do you see?"
	image = Image.open("zebra.jpg")

	inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
	with torch.inference_mode():
	output = model.generate(
	**inputs,
	do_sample=False,
	use_cache=True,
	max_new_tokens=128,
	eos_token_id=32001,
	pad_token_id=processor.tokenizer.pad_token_id
	)

	prompt_len = inputs["input_ids"].shape[1]
	decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
	```


	## Evaluation

	For captioning evaluation we measure CLIPScore and RefCLIPScore¹.

	\| Model \| Size \| Caption Length \| CLIPScore \| RefCLIPScore \|
	\| :---------------------------------- \| ---: \| -------------: \| --------: \| -----------: \|
	\| `llava-hf/llava-1.5-7b-hf` \| 7B \| Long \| 0.878 \| 0.529 \|
	\| `llava-hf/llava-1.5-7b-hf` \| 7B \| Short \| 0.886 \| 0.531 \|
	\| \|
	\| `Salesforce/instructblip-vicuna-7b` \| 7B \| Long \| 0.902 \| 0.534 \|
	\| `Salesforce/instructblip-vicuna-7b` \| 7B \| Short \| 0.848 \| 0.523 \|
	\| \| \|
	\| `unum-cloud/uform-gen-chat` \| 1.5B \| Long \| 0.860 \| 0.525 \|
	\| `unum-cloud/uform-gen-chat` \| 1.5B \| Short \| 0.858 \| 0.525 \|

	¹ We used `apple/DFN5B-CLIP-ViT-H-14-378` CLIP model.