AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Abstract
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.
Community
Project Page: https://av-odyssey.github.io/
Huggingface Dataset (AV-Odyssey): https://huggingface.co/datasets/AV-Odyssey/AV_Odyssey_Bench
Huggingface Dataset (DeafTest): https://huggingface.co/datasets/AV-Odyssey/Deaftest_dataset
Huggingface LeaderBoard: https://huggingface.co/spaces/AV-Odyssey/AV_Odyssey_Bench_Leaderboard
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (2024)
- AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models (2024)
- MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark (2024)
- StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding (2024)
- SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context (2024)
- Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination (2024)
- Ocean-omni: To Understand the World with Omni-modality (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper