Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Abstract
Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.
Community
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
- Project page: https://inst-it.github.io
- Github: https://github.com/inst-it/inst-it
- ArXiv: https://arxiv.org/abs/2412.03565
- Data and checkpoints: https://huggingface.co/Inst-IT
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos (2024)
- Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning (2024)
- TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models (2024)
- VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? (2024)
- HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding (2024)
- SparrowVQE: Visual Question Explanation for Course Content Understanding (2024)
- VQA$^2$: Visual Question Answering for Video Quality Assessment (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper