Update README.md
Browse files
README.md
CHANGED
@@ -17,14 +17,14 @@ Omni-Vision is a compact multimodal model that processes both visual and text in
|
|
17 |
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
|
18 |
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
|
19 |
|
20 |
-
**Feedback:**
|
21 |
|
22 |
## Intended Use Cases
|
23 |
|
24 |
-
1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
|
25 |
-
2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
|
26 |
|
27 |
-
##
|
28 |
|
29 |
| Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
|
30 |
|-------------------|----------------------|-----------|-------------|
|
@@ -36,22 +36,27 @@ The vision encoder first transforms input images into embeddings, which are then
|
|
36 |
| ScienceQA (Test) | 64.5 | 59.0 | NA |
|
37 |
| POPE | 89.4 | 84.1 | NA |
|
38 |
|
39 |
-
|
40 |
|
41 |
## How to use
|
42 |
|
43 |
-
This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase.
|
44 |
-
|
45 |
**Test in HuggingFace Space**
|
46 |
|
47 |
**Run Locally**
|
48 |
|
49 |
Install Nexa-SDK
|
50 |
|
|
|
|
|
|
|
|
|
51 |
```bash
|
52 |
nexa run omnivision
|
53 |
```
|
54 |
|
|
|
|
|
|
|
55 |
|
56 |
## Training
|
57 |
|
|
|
17 |
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
|
18 |
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
|
19 |
|
20 |
+
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
|
21 |
|
22 |
## Intended Use Cases
|
23 |
|
24 |
+
1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. E.g. "What kind of cat is this?"
|
25 |
+
2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story. E.g. "Describe this image."
|
26 |
|
27 |
+
## Benchmarks
|
28 |
|
29 |
| Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
|
30 |
|-------------------|----------------------|-----------|-------------|
|
|
|
36 |
| ScienceQA (Test) | 64.5 | 59.0 | NA |
|
37 |
| POPE | 89.4 | 84.1 | NA |
|
38 |
|
39 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztP1o5TgBAsFVzpGMy9H.png" alt="Benchmark Radar Chart" width="500"/>
|
40 |
|
41 |
## How to use
|
42 |
|
|
|
|
|
43 |
**Test in HuggingFace Space**
|
44 |
|
45 |
**Run Locally**
|
46 |
|
47 |
Install Nexa-SDK
|
48 |
|
49 |
+
Install Nexa-SDK:
|
50 |
+
|
51 |
+
> **Note**: To run our models locally, you'll need to install Nexa-SDK. It’s an on-device inference framework that enables efficient and flexible deployment of our models directly on your hardware. With support for text, image, audio, and multimodal processing, Nexa-SDK brings powerful AI capabilities to your local environment.
|
52 |
+
|
53 |
```bash
|
54 |
nexa run omnivision
|
55 |
```
|
56 |
|
57 |
+
## Technical Innovations for Edge Deployment
|
58 |
+
- 9x Token Reduction through Token Compression
|
59 |
+
- Minimal-Edit DPO for Enhanced Response Quality
|
60 |
|
61 |
## Training
|
62 |
|