MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Abstract
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70times more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.
Community
We introduced MegaPairs, a dataset with 26 million multimodal triplets from open domains. MegaPairs significantly advances universal multimodal retrieval, with models trained on it achieving state-of-the-art results on CIR and the MMEB benchmark. All code, well-trained models, and datasets will be publicly available soon!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant (2024)
- EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval (2024)
- Compositional Image Retrieval via Instruction-Aware Contrastive Learning (2024)
- CompCap: Improving Multimodal Large Language Models with Composite Captions (2024)
- MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale (2024)
- Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis (2024)
- BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper