arxiv:2410.13754

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Published on Oct 17

· Submitted by

jinjieni on Oct 18

#2 Paper of the day

Upvote

Authors:

Jinjie Ni ,

Deepanway Ghosal ,

Bo Li ,

David Junhao Zhang ,

Xiang Yue ,

Fuzhao Xue ,

Zian Zheng ,

Kaichen Zhang ,

Kabir Jain ,

Michael Shieh

Abstract

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

View arXiv page View PDF Add to collection

Community

jinjieni

Paper author Paper submitter Oct 18

•

edited Oct 18

MixEval-X is the first any-to-any, real-world benchmark featuring diverse input-output modalities, real-world task distributions, consistent high standards across modalities, and dynamism. It achieves up to 0.98 correlation with arena-like multi-modal evaluations while being way more efficient.