File size: 3,252 Bytes
80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 80e48d5 8793654 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
library_name: transformers
tags:
- image-captioning
- visual-question-answering
license: apache-2.0
datasets:
- X2FD/LVIS-Instruct4V
- BAAI/SVIT
- HuggingFaceH4/ultrachat_200k
language:
- en
pipeline_tag: image-to-text
widget:
- src: interior.jpg
example_title: Detailed caption
output:
text: "The image showcases a serene and well-lit bedroom. Dominating the scene is a bed, neatly made with a white blanket and a black headboard. Adjacent to the bed, a dresser stands tall, hosting a mirror, a vase, and a flower arrangement. A chair is positioned near the dresser, offering a comfortable spot to sit and relax. The room is adorned with a large window that offers a picturesque view of trees outside. The walls are painted in a soothing shade of white, enhancing the overall ambiance of the space."
- src: cat.jpg
example_title: Short caption
output:
text: "A white and orange cat stands on its hind legs, reaching towards a wooden table with a white teapot and a basket of red berries. The table is set on a wooden bench, surrounded by orange flowers. The cat's position and actions suggest curiosity and playfulness."
---
<h1 align="center">UForm</h1>
<h3 align="center">
Pocket-Sized Multimodal AI<br/>
For Content Understanding and Generation<br/>
</h3>
## Description
UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:
1. CLIP-like ViT-H/14
2. [Qwen1.5-0.5B-Chat](https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat)
The model was pre-trained on the internal image captioning dataset and fine-tuned on public instructions datasets: SVIT, LVIS, VQAs datasets.
### Usage
The generative model can be used to caption images, answer questions about them. Also it is suitable for a multimodal chat.
```python
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("unum-cloud/uform-gen2-qwen-halfB", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum-cloud/uform-gen2-qwen-halfB", trust_remote_code=True)
prompt = "Question or Instruction"
image = Image.open("image.jpg")
inputs = processor(text=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
output = model.generate(
**inputs,
do_sample=False,
use_cache=True,
max_new_tokens=256,
eos_token_id=151645,
pad_token_id=processor.tokenizer.pad_token_id
)
prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
```
You can check examples of different prompts in our demo space.
## Evaluation
For captioning evaluation we measure CLIPScore and RefCLIPScore¹.
| Model | LLM Size | SQA | MME | MMBench | Average¹ |
| :---------------------------------- | -------: | -----:| ------:| --------:| --------:|
| UForm-Gen2-Qwen-halfB | 0.5B | 45.5 | 880.1 | 42.0 | 29.31 |
| MobileVLM v2 | 1.4B | 52.1 | 1302.8 | 57.7 | 36.81 |
| LLaVA-Phi | 2.7B | 68.4 | 1335.1 | 59.8 | 42.95 |
¹MME scores were divided by 2000 before averaging.
|