File size: 6,739 Bytes
c9118f4 9e9abb0 c9118f4 cb6484b c9118f4 ca72ebc 79b60d0 cb6484b d4d211f cb6484b 6d9a1d1 cb6484b 46bff64 568fb5b 336cb9d c9118f4 cb6484b 336cb9d 84dd548 93c4844 9fc3117 0067dce 568fb5b 9fc3117 9e9abb0 c9118f4 ab88027 c9118f4 cb6484b 336cb9d 0067dce 77230a9 336cb9d 0a338cc c9118f4 ab88027 c9118f4 cb6484b 336cb9d b51f197 336cb9d cb6484b 336cb9d cb6484b a55214f c9118f4 9fc3117 cb6484b c9118f4 336cb9d c9118f4 336cb9d c9118f4 336cb9d ab88027 336cb9d ab88027 c9118f4 e92971e c9118f4 336cb9d cb6484b 336cb9d 93c4844 336cb9d c9118f4 cb6484b c9118f4 cb6484b 48fbc9a cb6484b c9118f4 d658b03 cb6484b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: apache-2.0
tags:
- multimodal
- conversational
- GGUF
- Image-Text-to-Text
---
# OmniVLM
## 🔥 Latest Update
- [Dec 16, 2024] Our work **"OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference"** is now live on [Arxiv](https://arxiv.org/abs/2412.11475)! 🚀
- [Nov 27, 2024] **Model Improvements:** OmniVLM v3 model's **GGUF file has been updated** in this Hugging Face Repo! ✨
👉 Test these exciting changes in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo)
- [Nov 22, 2024] **Model Improvements:** OmniVLM v2 model's **GGUF file has been updated** in this Hugging Face Repo! ✨ Key Improvements Include:
- Enhanced Art Descriptions
- Better Complex Image Understanding
- Improved Anime Recognition
- More Accurate Color and Detail Detection
- Expanded World Knowledge
We are continuously improving OmniVLM-968M based on your valuable feedback! **More exciting updates coming soon - Stay tuned!** ⭐
## Introduction
OmniVLM is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
- **9x Token Reduction**: Reduces image tokens from **729** to **81**, cutting latency and computational cost aggressively. Note that the computation of vision encoder and the projection part keep the same, but the computation of language model backbone is reduced due to 9X shorter image token span.
- **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
**Quick Links:**
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo). (Updated 2024 Nov 21)
2. [Quickstart for local setup](#how-to-use-on-device)
3. Learn more in our [Blogs](https://nexa.ai/blogs/omni-vision)
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
## Intended Use Cases
OmniVLM is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
**Example Demo:**
Generating captions for a 1046×1568 image on M4 Pro Macbook takes **< 2s processing time** and requires only 988 MB RAM and 948 MB Storage.
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/ueevDxicb98fXQ7zGN_E2.png" alt="Example" style="width:700px;"/>
## Benchmarks
Below we demonstrate a figure to show how OmniVLM performs against nanollava. In all the tasks, OmniVLM outperforms the previous world's smallest vision-language model.
<img src="benchmark.png" alt="Benchmark Radar Chart" style="width:500px;"/>
We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of OmniVLM.
| Benchmark | Nexa AI OmniVLM v2 | Nexa AI OmniVLM v1 | nanoLLAVA |
|-------------------|------------------------|------------------------|-----------|
| ScienceQA (Eval) | 71.0 | 62.2 | 59.0 |
| ScienceQA (Test) | 71.0 | 64.5 | 59.0 |
| POPE | 93.3 | 89.4 | 84.1 |
| MM-VET | 30.9 | 27.5 | 23.9 |
| ChartQA (Test) | 61.9 | 59.2 | NA |
| MMMU (Test) | 42.1 | 41.8 | 28.6 |
| MMMU (Eval) | 40.0 | 39.9 | 30.4 |
## How to Use On Device
In the following, we demonstrate how to run OmniVLM locally on your device.
**Step 1: Install Nexa-SDK (local on-device inference framework)**
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.
**Step 2: Then run the following code in your terminal**
```bash
nexa run omniVLM
```
## Model Architecture ##
OmniVLM's architecture consists of three key components:
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
## Training
We developed OmniVLM through a three-stage training pipeline:
**Pretraining:**
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
**Supervised Fine-tuning (SFT):**
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.
**Direct Preference Optimization (DPO):**
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
## What's next for OmniVLM?
OmniVLM is in early development and we are working to address current limitations:
- Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
- Improve document and text understanding
In the long term, we aim to develop OmniVLM as a fully optimized, production-ready solution for edge AI multimodal applications.
### Follow us
[Blogs](https://nexa.ai/blogs/OmniVLM) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai) |