alexchen4ai commited on
Commit
93c4844
1 Parent(s): 0d8b9dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -13,7 +13,7 @@ tags:
13
  Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
14
 
15
  - **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
16
- - **Trustworthy result**: Reduces hallucinations using **DPO** training from trustworthy data.
17
 
18
  **Quick Links:**
19
  1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
@@ -70,7 +70,7 @@ Omnivision's architecture consists of three key components:
70
 
71
  - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
72
  - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
73
- - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
74
 
75
  The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
76
 
 
13
  Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
14
 
15
  - **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
16
+ - **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
17
 
18
  **Quick Links:**
19
  1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
 
70
 
71
  - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
72
  - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
73
+ - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.
74
 
75
  The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
76