File size: 4,322 Bytes
983f690
 
 
 
48f1c5c
d5681a9
983f690
 
a732c89
983f690
d5681a9
 
 
 
aa7c7be
983f690
 
dba7af6
983f690
dba7af6
983f690
 
 
 
 
f986d1d
983f690
bfdfb84
983f690
eafa22b
983f690
eafa22b
983f690
 
f986d1d
983f690
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51a6725
b31d1e1
51a6725
b31d1e1
51a6725
983f690
 
 
 
 
 
 
 
51a6725
983f690
 
 
 
 
 
 
 
 
51a6725
983f690
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6f3920
 
983f690
 
 
e6f3920
 
 
 
983f690
 
 
 
 
e6f3920
983f690
e6f3920
 
d5681a9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
language:
- en
tags:
- Vision
- HelpingAI
license: mit
library_name: transformers
base_model: visheratin/MC-LLaVA-3b
widget:
- text: What animal is it?
  src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
- text: Where is it?
  src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
pipeline_tag: image-text-to-text
---

# HelpingAI-Vision

<a target="_blank" href="https://colab.research.google.com/drive/1t2OAMVSKsiqVgvuHq7rhyNv28b67u0D8">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Model details

The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the HelpingAI-Lite and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.

For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.

HelpingAI-Vision was fine-tuned from MC-LLaVA-3b.

The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free ask
```
<|im_start|>system
You are Vortex, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```

## How to use

**Install dependencies**

```bash
!pip install -q open_clip_torch timm einops
```

**Download modeling files**

```python
from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="processing_llava.py", local_dir="./", force_download=True)
```

**Create a model**

```python
from modeling_llava import LlavaForConditionalGeneration
import torch

model = LlavaForConditionalGeneration.from_pretrained("OEvortex/HelpingAI-Vision", torch_dtype=torch.float16)
model = model.to("cuda")
```

**Create processors**

```python
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor

tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-Vision")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)
```

**Set image and text**

```python
from PIL import Image
import requests

image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)

prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""
```

**Process inputs**

```python
with torch.inference_mode():
  inputs = processor(prompt, raw_image, model, return_tensors='pt')

inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)

from transformers import TextStreamer

streamer = TextStreamer(tokenizer)
```

**Generate the data**

```python
%%time
with torch.inference_mode():
  output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=1.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer)
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", ""))
```