alexchen4ai
commited on
Commit
•
93c4844
1
Parent(s):
0d8b9dd
Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ tags:
|
|
13 |
Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
|
14 |
|
15 |
- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
|
16 |
-
- **Trustworthy
|
17 |
|
18 |
**Quick Links:**
|
19 |
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
|
@@ -70,7 +70,7 @@ Omnivision's architecture consists of three key components:
|
|
70 |
|
71 |
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
|
72 |
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
|
73 |
-
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
|
74 |
|
75 |
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
|
76 |
|
|
|
13 |
Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
|
14 |
|
15 |
- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
|
16 |
+
- **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
|
17 |
|
18 |
**Quick Links:**
|
19 |
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
|
|
|
70 |
|
71 |
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
|
72 |
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
|
73 |
+
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.
|
74 |
|
75 |
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
|
76 |
|