omnivision-968M / README.md

Update README.md

77230a9 verified 24 minutes ago

5.62 kB

	---
	license: apache-2.0
	tags:
	- multimodal
	- conversational
	- GGUF
	- Image-Text-to-Text
	---
	# Omnivision

	## Introduction

	Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:

	- 9x Token Reduction: Reduces image tokens from 729 to 81, cutting latency and computational cost.
	- Trustworthy Result: Reduces hallucinations using DPO training from trustworthy data.

	Quick Links:
	1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
	2. [Quickstart for local setup](#how-to-use-on-device)
	3. Learn more in our [Blogs](https://nexa.ai/blogs/omni-vision)

	Feedback: Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)

	## Intended Use Cases
	Omnivision is intended for Visual Question Answering (answering questions about images) and Image Captioning (describing scenes in photos), making it ideal for on-device applications.

	Example Demo:
	Generating captions for a 1046×1568 image on M4 Pro Macbook takes < 2s processing time and requires only 988 MB RAM and 948 MB Storage.

	<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/P8HFmA7huCdpMClWVuXZO.png" alt="Example" style="width:700px;"/>


	## Benchmarks

	Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.

	<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>

	We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.

	\| Benchmark \| Nexa AI Omnivision \| nanoLLAVA \| Qwen2-VL-2B \|
	\|-------------------\|----------------------\|-----------\|-------------\|
	\| MM-VET \| 27.5 \| 23.9 \| 49.5 \|
	\| ChartQA (Test) \| 59.2 \| NA \| 73.5 \|
	\| MMMU (Test) \| 41.8 \| 28.6 \| 41.1 \|
	\| MMMU (Eval) \| 39.9 \| 30.4 \| 41.1 \|
	\| ScienceQA (Eval) \| 62.2 \| 59.0 \| NA \|
	\| ScienceQA (Test) \| 64.5 \| 59.0 \| NA \|
	\| POPE \| 89.4 \| 84.1 \| NA \|


	## How to Use On Device
	In the following, we demonstrate how to run Omnivision locally on your device.

	Step 1: Install Nexa-SDK (local on-device inference framework)

	[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)

	> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

	Step 2: Then run the following code in your terminal

	```bash
	nexa run omnivision
	```

	## Model Architecture ##
	Omnivision's architecture consists of three key components:

	- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
	- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
	- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.

	The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.

	## Training

	We developed Omnivision through a three-stage training pipeline:

	Pretraining:
	The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.

	Supervised Fine-tuning (SFT):
	We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.

	Direct Preference Optimization (DPO):
	The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics

	## What's next for Omnivision?
	Omnivision is in early development and we are working to address current limitations:
	- Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
	- Improve document and text understanding

	In the long term, we aim to develop Omnivision as a fully optimized, production-ready solution for edge AI multimodal applications.

	### Follow us
	[Blogs](https://nexa.ai/blogs/omni-vision) \| [Discord](https://discord.gg/nexa-ai) \| [X(Twitter)](https://x.com/alanzhuly)