OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
Abstract
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM's visual understanding ability. To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction. Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization. Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .
Community
GitHub Repo: https://github.com/SHI-Labs/OLA-VLM; Project Page: https://praeclarumjj3.github.io/ola_vlm/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Multi-modal Large Language Model through Boosting Vision Capabilities (2024)
- Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion (2024)
- [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs (2024)
- MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding (2024)
- COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training (2024)
- Maya: An Instruction Finetuned Multilingual Multimodal Model (2024)
- Grounding Descriptions in Images informs Zero-Shot Visual Recognition (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper