--- license: cc tags: - multimodal - conversational - GGUF - Image-Text-to-Text --- ## Model Information Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices. **Model Architecture:** Omni-Vision's architecture consists of three key components: - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai) ## Intended Use Cases 1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. E.g. "What kind of cat is this?" 2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story. E.g. "Describe this image." ## Benchmarks | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B | |-------------------|----------------------|-----------|-------------| | MM-VET | 27.5 | 23.9 | 49.5 | | ChartQA (Test) | 59.2 | NA | 73.5 | | MMMU (Test) | 41.8 | 28.6 | 41.1 | | MMMU (Eval) | 39.9 | 30.4 | 41.1 | | ScienceQA (Eval) | 62.2 | 59.0 | NA | | ScienceQA (Test) | 64.5 | 59.0 | NA | | POPE | 89.4 | 84.1 | NA | Benchmark Radar Chart ## How to use **Test in HuggingFace Space** **Run Locally** Install Nexa-SDK Install Nexa-SDK: > **Note**: To run our models locally, you'll need to install Nexa-SDK. It’s an on-device inference framework that enables efficient and flexible deployment of our models directly on your hardware. With support for text, image, audio, and multimodal processing, Nexa-SDK brings powerful AI capabilities to your local environment. ```bash nexa run omnivision ``` ## Technical Innovations for Edge Deployment - 9x Token Reduction through Token Compression - Minimal-Edit DPO for Enhanced Response Quality ## Training We developed Omni-Vision through a three-stage training pipeline: **Pretraining:** The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. **Supervised Fine-tuning (SFT):** We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses. **Direct Preference Optimization (DPO):** The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics ### Learn more in our blogs ### Join Discord Community: ### Website: nexa.ai