NexaAIDev
/

OmniVLM-968M

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

alexchen4ai commited on Nov 21, 2024

Commit

84dd548

•

1 Parent(s): 509316c

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ tags:
 Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
-- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
 - **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
 **Quick Links:**

 Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
+- **9x Token Reduction**: Reduces image tokens from **729** to **81**, cutting latency and computational cost aggressively. Note that the computation of vision encoder and the projection part keep the same, but the computation of language model backbone is reduced due to 9X shorter image token span.
 - **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
 **Quick Links:**