Update README.md
Browse files
README.md
CHANGED
@@ -6,26 +6,45 @@ tags:
|
|
6 |
- GGUF
|
7 |
- Image-Text-to-Text
|
8 |
---
|
9 |
-
|
10 |
|
11 |
-
|
12 |
|
|
|
13 |
|
14 |
-
**
|
15 |
-
-
|
16 |
-
|
17 |
-
|
18 |
-
|
|
|
|
|
19 |
|
20 |
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
|
21 |
|
22 |
## Intended Use Cases
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
-
1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. E.g. "What kind of cat is this?"
|
25 |
-
2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story. E.g. "Describe this image."
|
26 |
|
27 |
## Benchmarks
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
| Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
|
30 |
|-------------------|----------------------|-----------|-------------|
|
31 |
| MM-VET | 27.5 | 23.9 | 49.5 |
|
@@ -36,27 +55,30 @@ The vision encoder first transforms input images into embeddings, which are then
|
|
36 |
| ScienceQA (Test) | 64.5 | 59.0 | NA |
|
37 |
| POPE | 89.4 | 84.1 | NA |
|
38 |
|
39 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztP1o5TgBAsFVzpGMy9H.png" alt="Benchmark Radar Chart" width="500"/>
|
40 |
-
|
41 |
-
## How to use
|
42 |
|
43 |
-
|
|
|
44 |
|
45 |
-
**
|
46 |
|
47 |
-
Install Nexa-SDK
|
48 |
|
49 |
-
|
50 |
|
51 |
-
|
52 |
|
53 |
```bash
|
54 |
nexa run omnivision
|
55 |
```
|
56 |
|
57 |
-
##
|
58 |
-
-
|
59 |
-
|
|
|
|
|
|
|
|
|
|
|
60 |
|
61 |
## Training
|
62 |
|
@@ -73,6 +95,6 @@ The final stage implements DPO by first generating responses to images using the
|
|
73 |
|
74 |
|
75 |
### Learn more in our blogs
|
76 |
-
|
77 |
-
###
|
78 |
-
|
|
|
6 |
- GGUF
|
7 |
- Image-Text-to-Text
|
8 |
---
|
9 |
+
# Omnivision
|
10 |
|
11 |
+
## Introduction
|
12 |
|
13 |
+
Omni-Vision is a sub-billion (968M) multimodal model capable of processing both visual and text inputs. Built upon LLaVA's architecture, it introduces a novel token compression technique to reduce image token sizes (from 729 to 81), optimizing efficiency without compromising visual understanding on edge devices. It has two key enhancements:
|
14 |
|
15 |
+
- **9x Token Reduction through Token Compression**: Significant decrease in image token count, reducing latency and computational cost, ideal for on-device applications.
|
16 |
+
- **Minimal-Edit DPO for Enhanced Response Quality**: Improves model responses by using targeted edits, maintaining core capabilities without significant behavior shifts.
|
17 |
+
|
18 |
+
Quick Links:
|
19 |
+
1. Interact directly in the HuggingFace Space.
|
20 |
+
2. [How to run locally in 2 simple steps](#how-to-run-locally)
|
21 |
+
3. Learn more details in our blogs
|
22 |
|
23 |
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
|
24 |
|
25 |
## Intended Use Cases
|
26 |
+
Omnivision is best used locally on edge devices. It is intended for visual question answering ()
|
27 |
+
|
28 |
+
1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
|
29 |
+
2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
|
30 |
+
|
31 |
+
Example:
|
32 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/w07yBAp_lZt12E_Vz0Lyk.png" alt="Benchmark Radar Chart" style="width:250px;"/>
|
33 |
+
```bash
|
34 |
+
>>>> caption this
|
35 |
+
```
|
36 |
+
|
37 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/dHZSgVGY9yV_lsNIW-iRj.png)
|
38 |
|
|
|
|
|
39 |
|
40 |
## Benchmarks
|
41 |
|
42 |
+
Below we demonstrate a figure to show how omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
|
43 |
+
|
44 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
|
45 |
+
|
46 |
+
We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of omnivision.
|
47 |
+
|
48 |
| Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
|
49 |
|-------------------|----------------------|-----------|-------------|
|
50 |
| MM-VET | 27.5 | 23.9 | 49.5 |
|
|
|
55 |
| ScienceQA (Test) | 64.5 | 59.0 | NA |
|
56 |
| POPE | 89.4 | 84.1 | NA |
|
57 |
|
|
|
|
|
|
|
58 |
|
59 |
+
## How to Use - Quickstart
|
60 |
+
In the following, we demonstrate how to run omnivision locally on your device.
|
61 |
|
62 |
+
**Step 1: Install Nexa-SDK (local on-device inference framework)**
|
63 |
|
64 |
+
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
|
65 |
|
66 |
+
> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.
|
67 |
|
68 |
+
**Step 2: Then run the following code in your terminal**
|
69 |
|
70 |
```bash
|
71 |
nexa run omnivision
|
72 |
```
|
73 |
|
74 |
+
## Model Architecture ##
|
75 |
+
Omni-Vision's architecture consists of three key components:
|
76 |
+
|
77 |
+
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
|
78 |
+
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
|
79 |
+
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
|
80 |
+
|
81 |
+
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
|
82 |
|
83 |
## Training
|
84 |
|
|
|
95 |
|
96 |
|
97 |
### Learn more in our blogs
|
98 |
+
[Blogs](https://nexa.ai)
|
99 |
+
### Join Discord Community
|
100 |
+
[Discord](https://discord.gg/nexa-ai)
|