NexaAIDev
/

omnivision-968M

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

alexchen4ai commited on about 21 hours ago

Commit

93c4844

•

1 Parent(s): 0d8b9dd

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ tags:
 Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
 - **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
-- **Trustworthy result**: Reduces hallucinations using **DPO** training from trustworthy data.
 **Quick Links:**
 1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
@@ -70,7 +70,7 @@ Omnivision's architecture consists of three key components:
 - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
 - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
-- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
 The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.

 Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
 - **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
+- **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
 **Quick Links:**
 1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
 - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
 - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
+- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.
 The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.