MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Abstract
Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external <PRE_TAG>textual knowledge</POST_TAG> for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, <PRE_TAG>MRAG-Bench</POST_TAG>, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than <PRE_TAG>textual knowledge</POST_TAG>, for instance, more images from varying viewpoints. <PRE_TAG>MRAG-Bench</POST_TAG> consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With <PRE_TAG>MRAG-Bench</POST_TAG>, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to <PRE_TAG>textual knowledge</POST_TAG>, confirming that <PRE_TAG>MRAG-Bench</POST_TAG> is vision-centric. Additionally, we conduct extensive analysis with <PRE_TAG>MRAG-Bench</POST_TAG>, which offers valuable insights into retrieval-augmented <PRE_TAG>LVLMs</POST_TAG>. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of <PRE_TAG>MRAG-Bench</POST_TAG> in encouraging the community to enhance LVLMs' ability to utilize retrieved visual knowledge more effectively.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper