|
--- |
|
license: apache-2.0 |
|
tags: |
|
- multimodal |
|
- conversational |
|
- GGUF |
|
- Image-Text-to-Text |
|
--- |
|
# Omnivision |
|
|
|
## Introduction |
|
|
|
Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features: |
|
|
|
- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost. |
|
- **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data. |
|
|
|
**Quick Links:** |
|
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo). |
|
2. [Quickstart for local setup](#how-to-use-on-device) |
|
3. Learn more in our [Blogs](https://nexa.ai/blogs/omni-vision) |
|
|
|
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai) |
|
|
|
## Intended Use Cases |
|
Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications. |
|
|
|
**Example Demo:** |
|
Generating captions for a 1046×1568 image on M4 Pro Macbook takes **< 2s processing time** and requires only 988 MB RAM and 948 MB Storage. |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/P8HFmA7huCdpMClWVuXZO.png" alt="Example" style="width:700px;"/> |
|
|
|
|
|
## Benchmarks |
|
|
|
Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model. |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/> |
|
|
|
We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision. |
|
|
|
| Benchmark | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B | |
|
|-------------------|----------------------|-----------|-------------| |
|
| MM-VET | 27.5 | 23.9 | 49.5 | |
|
| ChartQA (Test) | 59.2 | NA | 73.5 | |
|
| MMMU (Test) | 41.8 | 28.6 | 41.1 | |
|
| MMMU (Eval) | 39.9 | 30.4 | 41.1 | |
|
| ScienceQA (Eval) | 62.2 | 59.0 | NA | |
|
| ScienceQA (Test) | 64.5 | 59.0 | NA | |
|
| POPE | 89.4 | 84.1 | NA | |
|
|
|
|
|
## How to Use On Device |
|
In the following, we demonstrate how to run Omnivision locally on your device. |
|
|
|
**Step 1: Install Nexa-SDK (local on-device inference framework)** |
|
|
|
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer) |
|
|
|
> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer. |
|
|
|
**Step 2: Then run the following code in your terminal** |
|
|
|
```bash |
|
nexa run omnivision |
|
``` |
|
|
|
## Model Architecture ## |
|
Omnivision's architecture consists of three key components: |
|
|
|
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs |
|
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings |
|
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens. |
|
|
|
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. |
|
|
|
## Training |
|
|
|
We developed Omnivision through a three-stage training pipeline: |
|
|
|
**Pretraining:** |
|
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. |
|
|
|
**Supervised Fine-tuning (SFT):** |
|
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses. |
|
|
|
**Direct Preference Optimization (DPO):** |
|
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics |
|
|
|
## What's next for Omnivision? |
|
Omnivision is in early development and we are working to address current limitations: |
|
- Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality. |
|
- Improve document and text understanding |
|
|
|
In the long term, we aim to develop Omnivision as a fully optimized, production-ready solution for edge AI multimodal applications. |
|
|
|
### Follow us |
|
[Blogs](https://nexa.ai/blogs/omni-vision) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly) |