NexaAIDev
/

omnivision-968M

@@ -6,26 +6,45 @@ tags:
 - GGUF
 - Image-Text-to-Text
 ---
-## Model Information
-Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices.
-**Model Architecture:** Omni-Vision's architecture consists of three key components:
-- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
-- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
-- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
-The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
 **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
 ## Intended Use Cases
-1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. E.g. "What kind of cat is this?"
-2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.  E.g. "Describe this image."
 ## Benchmarks
 | Benchmark         | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
 |-------------------|----------------------|-----------|-------------|
 | MM-VET            | 27.5                | 23.9      | 49.5        |
@@ -36,27 +55,30 @@ The vision encoder first transforms input images into embeddings, which are then
 | ScienceQA (Test)  | 64.5                | 59.0      | NA          |
 | POPE              | 89.4                | 84.1      | NA          |
-<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztP1o5TgBAsFVzpGMy9H.png" alt="Benchmark Radar Chart" width="500"/>
-## How to use
-**Test in HuggingFace Space**
-**Run Locally**
-Install Nexa-SDK
-Install Nexa-SDK:
-> **Note**: To run our models locally, you'll need to install Nexa-SDK. It’s an on-device inference framework that enables efficient and flexible deployment of our models directly on your hardware. With support for text, image, audio, and multimodal processing, Nexa-SDK brings powerful AI capabilities to your local environment.
 ```bash
 nexa run omnivision
 ```
-## Technical Innovations for Edge Deployment
-- 9x Token Reduction through Token Compression
-- Minimal-Edit DPO for Enhanced Response Quality
 ## Training
@@ -73,6 +95,6 @@ The final stage implements DPO by first generating responses to images using the
 ### Learn more in our blogs
-### Join Discord Community:
-### Website: nexa.ai

 - GGUF
 - Image-Text-to-Text
 ---
+# Omnivision
+## Introduction
+Omni-Vision is a sub-billion (968M) multimodal model capable of processing both visual and text inputs. Built upon LLaVA's architecture, it introduces a novel token compression technique to reduce image token sizes (from 729 to 81), optimizing efficiency without compromising visual understanding on edge devices. It has two key enhancements:
+- **9x Token Reduction through Token Compression**: Significant decrease in image token count, reducing latency and computational cost, ideal for on-device applications.
+- **Minimal-Edit DPO for Enhanced Response Quality**: Improves model responses by using targeted edits, maintaining core capabilities without significant behavior shifts.
+Quick Links:
+1. Interact directly in the HuggingFace Space.
+2. [How to run locally in 2 simple steps](#how-to-run-locally)
+3. Learn more details in our blogs
 **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
 ## Intended Use Cases
+Omnivision is best used locally on edge devices. It is intended for visual question answering ()
+1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
+2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
+Example:
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/w07yBAp_lZt12E_Vz0Lyk.png" alt="Benchmark Radar Chart" style="width:250px;"/>
+```bash
+>>>> caption this
+```
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/dHZSgVGY9yV_lsNIW-iRj.png)
 ## Benchmarks
+Below we demonstrate a figure to show how omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
+We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of omnivision.
 | Benchmark         | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
 |-------------------|----------------------|-----------|-------------|
 | MM-VET            | 27.5                | 23.9      | 49.5        |
 | ScienceQA (Test)  | 64.5                | 59.0      | NA          |
 | POPE              | 89.4                | 84.1      | NA          |
+## How to Use - Quickstart
+In the following, we demonstrate how to run omnivision locally on your device.
+**Step 1: Install Nexa-SDK (local on-device inference framework)**
+[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
+> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.
+**Step 2: Then run the following code in your terminal**
 ```bash
 nexa run omnivision
 ```
+## Model Architecture ##
+Omni-Vision's architecture consists of three key components:
+- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
+- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
+- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
+The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
 ## Training
 ### Learn more in our blogs
+[Blogs](https://nexa.ai)
+### Join Discord Community
+[Discord](https://discord.gg/nexa-ai)