alanzhuly commited on
Commit
336cb9d
1 Parent(s): ab88027

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -23
README.md CHANGED
@@ -6,26 +6,45 @@ tags:
6
  - GGUF
7
  - Image-Text-to-Text
8
  ---
9
- ## Model Information
10
 
11
- Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices.
12
 
 
13
 
14
- **Model Architecture:** Omni-Vision's architecture consists of three key components:
15
- - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
16
- - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
17
- - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
18
- The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
 
 
19
 
20
  **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
21
 
22
  ## Intended Use Cases
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- 1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. E.g. "What kind of cat is this?"
25
- 2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story. E.g. "Describe this image."
26
 
27
  ## Benchmarks
28
 
 
 
 
 
 
 
29
  | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
30
  |-------------------|----------------------|-----------|-------------|
31
  | MM-VET | 27.5 | 23.9 | 49.5 |
@@ -36,27 +55,30 @@ The vision encoder first transforms input images into embeddings, which are then
36
  | ScienceQA (Test) | 64.5 | 59.0 | NA |
37
  | POPE | 89.4 | 84.1 | NA |
38
 
39
- <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztP1o5TgBAsFVzpGMy9H.png" alt="Benchmark Radar Chart" width="500"/>
40
-
41
- ## How to use
42
 
43
- **Test in HuggingFace Space**
 
44
 
45
- **Run Locally**
46
 
47
- Install Nexa-SDK
48
 
49
- Install Nexa-SDK:
50
 
51
- > **Note**: To run our models locally, you'll need to install Nexa-SDK. It’s an on-device inference framework that enables efficient and flexible deployment of our models directly on your hardware. With support for text, image, audio, and multimodal processing, Nexa-SDK brings powerful AI capabilities to your local environment.
52
 
53
  ```bash
54
  nexa run omnivision
55
  ```
56
 
57
- ## Technical Innovations for Edge Deployment
58
- - 9x Token Reduction through Token Compression
59
- - Minimal-Edit DPO for Enhanced Response Quality
 
 
 
 
 
60
 
61
  ## Training
62
 
@@ -73,6 +95,6 @@ The final stage implements DPO by first generating responses to images using the
73
 
74
 
75
  ### Learn more in our blogs
76
- ### Join Discord Community:
77
- ### Website: nexa.ai
78
-
 
6
  - GGUF
7
  - Image-Text-to-Text
8
  ---
9
+ # Omnivision
10
 
11
+ ## Introduction
12
 
13
+ Omni-Vision is a sub-billion (968M) multimodal model capable of processing both visual and text inputs. Built upon LLaVA's architecture, it introduces a novel token compression technique to reduce image token sizes (from 729 to 81), optimizing efficiency without compromising visual understanding on edge devices. It has two key enhancements:
14
 
15
+ - **9x Token Reduction through Token Compression**: Significant decrease in image token count, reducing latency and computational cost, ideal for on-device applications.
16
+ - **Minimal-Edit DPO for Enhanced Response Quality**: Improves model responses by using targeted edits, maintaining core capabilities without significant behavior shifts.
17
+
18
+ Quick Links:
19
+ 1. Interact directly in the HuggingFace Space.
20
+ 2. [How to run locally in 2 simple steps](#how-to-run-locally)
21
+ 3. Learn more details in our blogs
22
 
23
  **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
24
 
25
  ## Intended Use Cases
26
+ Omnivision is best used locally on edge devices. It is intended for visual question answering ()
27
+
28
+ 1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
29
+ 2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
30
+
31
+ Example:
32
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/w07yBAp_lZt12E_Vz0Lyk.png" alt="Benchmark Radar Chart" style="width:250px;"/>
33
+ ```bash
34
+ >>>> caption this
35
+ ```
36
+
37
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/dHZSgVGY9yV_lsNIW-iRj.png)
38
 
 
 
39
 
40
  ## Benchmarks
41
 
42
+ Below we demonstrate a figure to show how omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
43
+
44
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
45
+
46
+ We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of omnivision.
47
+
48
  | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
49
  |-------------------|----------------------|-----------|-------------|
50
  | MM-VET | 27.5 | 23.9 | 49.5 |
 
55
  | ScienceQA (Test) | 64.5 | 59.0 | NA |
56
  | POPE | 89.4 | 84.1 | NA |
57
 
 
 
 
58
 
59
+ ## How to Use - Quickstart
60
+ In the following, we demonstrate how to run omnivision locally on your device.
61
 
62
+ **Step 1: Install Nexa-SDK (local on-device inference framework)**
63
 
64
+ [Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
65
 
66
+ > Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.
67
 
68
+ **Step 2: Then run the following code in your terminal**
69
 
70
  ```bash
71
  nexa run omnivision
72
  ```
73
 
74
+ ## Model Architecture ##
75
+ Omni-Vision's architecture consists of three key components:
76
+
77
+ - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
78
+ - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
79
+ - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
80
+
81
+ The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
82
 
83
  ## Training
84
 
 
95
 
96
 
97
  ### Learn more in our blogs
98
+ [Blogs](https://nexa.ai)
99
+ ### Join Discord Community
100
+ [Discord](https://discord.gg/nexa-ai)