omnivision-968M / README.md
alanzhuly's picture
Update README.md
77230a9 verified
---
license: apache-2.0
tags:
- multimodal
- conversational
- GGUF
- Image-Text-to-Text
---
# Omnivision
## Introduction
Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
- **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
**Quick Links:**
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
2. [Quickstart for local setup](#how-to-use-on-device)
3. Learn more in our [Blogs](https://nexa.ai/blogs/omni-vision)
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
## Intended Use Cases
Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
**Example Demo:**
Generating captions for a 1046×1568 image on M4 Pro Macbook takes **< 2s processing time** and requires only 988 MB RAM and 948 MB Storage.
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/P8HFmA7huCdpMClWVuXZO.png" alt="Example" style="width:700px;"/>
## Benchmarks
Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.
| Benchmark | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B |
|-------------------|----------------------|-----------|-------------|
| MM-VET | 27.5 | 23.9 | 49.5 |
| ChartQA (Test) | 59.2 | NA | 73.5 |
| MMMU (Test) | 41.8 | 28.6 | 41.1 |
| MMMU (Eval) | 39.9 | 30.4 | 41.1 |
| ScienceQA (Eval) | 62.2 | 59.0 | NA |
| ScienceQA (Test) | 64.5 | 59.0 | NA |
| POPE | 89.4 | 84.1 | NA |
## How to Use On Device
In the following, we demonstrate how to run Omnivision locally on your device.
**Step 1: Install Nexa-SDK (local on-device inference framework)**
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.
**Step 2: Then run the following code in your terminal**
```bash
nexa run omnivision
```
## Model Architecture ##
Omnivision's architecture consists of three key components:
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
## Training
We developed Omnivision through a three-stage training pipeline:
**Pretraining:**
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
**Supervised Fine-tuning (SFT):**
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.
**Direct Preference Optimization (DPO):**
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
## What's next for Omnivision?
Omnivision is in early development and we are working to address current limitations:
- Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
- Improve document and text understanding
In the long term, we aim to develop Omnivision as a fully optimized, production-ready solution for edge AI multimodal applications.
### Follow us
[Blogs](https://nexa.ai/blogs/omni-vision) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly)