NexaAIDev
/

omnivision-968M

@@ -17,14 +17,14 @@ Omni-Vision is a compact multimodal model that processes both visual and text in
 - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
 The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
-**Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go [here](https://github.com/meta-llama/llama-recipes).
 ## Intended Use Cases
-1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
-2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
-## ## Benchmarks
 | Benchmark         | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
 |-------------------|----------------------|-----------|-------------|
@@ -36,22 +36,27 @@ The vision encoder first transforms input images into embeddings, which are then
 | ScienceQA (Test)  | 64.5                | 59.0      | NA          |
 | POPE              | 89.4                | 84.1      | NA          |
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztPlo5TgBAsFvZpGMy9H.png)
 ## How to use
-This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase.
 **Test in HuggingFace Space**
 **Run Locally**
 Install Nexa-SDK
 ```bash
 nexa run omnivision
 ```
 ## Training

 - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
 The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
+**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
 ## Intended Use Cases
+1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. E.g. "What kind of cat is this?"
+2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.  E.g. "Describe this image."
+## Benchmarks
 | Benchmark         | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
 |-------------------|----------------------|-----------|-------------|
 | ScienceQA (Test)  | 64.5                | 59.0      | NA          |
 | POPE              | 89.4                | 84.1      | NA          |
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztP1o5TgBAsFVzpGMy9H.png" alt="Benchmark Radar Chart" width="500"/>
 ## How to use
 **Test in HuggingFace Space**
 **Run Locally**
 Install Nexa-SDK
+Install Nexa-SDK:
+> **Note**: To run our models locally, you'll need to install Nexa-SDK. It’s an on-device inference framework that enables efficient and flexible deployment of our models directly on your hardware. With support for text, image, audio, and multimodal processing, Nexa-SDK brings powerful AI capabilities to your local environment.
 ```bash
 nexa run omnivision
 ```
+## Technical Innovations for Edge Deployment
+- 9x Token Reduction through Token Compression
+- Minimal-Edit DPO for Enhanced Response Quality
 ## Training