CompCap: Improving Multimodal Large Language Models with Composite Captions
Abstract
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.
Community
Image caption dataset for composite image
The dataset is under preparation and will be released later this month.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions (2024)
- Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model (2024)
- Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis (2024)
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives (2024)
- VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information (2024)
- Personalized Multimodal Large Language Models: A Survey (2024)
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper