kaizuberbuehler
's Collections
Benchmarks
updated
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
187
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
Benchmark for Expert AGI
Paper
•
2311.16502
•
Published
•
35
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
24
RULER: What's the Real Context Size of Your Long-Context Language
Models?
Paper
•
2404.06654
•
Published
•
34
CantTalkAboutThis: Aligning Language Models to Stay on Topic in
Dialogues
Paper
•
2404.03820
•
Published
•
24
CodeEditorBench: Evaluating Code Editing Capability of Large Language
Models
Paper
•
2404.03543
•
Published
•
15
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and
Human Ratings
Paper
•
2404.16820
•
Published
•
15
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
On the Planning Abilities of Large Language Models -- A Critical
Investigation
Paper
•
2305.15771
•
Published
•
1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
21
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper
•
2406.09170
•
Published
•
24
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
•
2406.09411
•
Published
•
18
CS-Bench: A Comprehensive Benchmark for Large Language Models towards
Computer Science Mastery
Paper
•
2406.08587
•
Published
•
15
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
61
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
•
2406.09961
•
Published
•
55
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
53
BABILong: Testing the Limits of LLMs with Long Context
Reasoning-in-a-Haystack
Paper
•
2406.10149
•
Published
•
49
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Paper
•
2407.18961
•
Published
•
40
AppWorld: A Controllable World of Apps and People for Benchmarking
Interactive Coding Agents
Paper
•
2407.18901
•
Published
•
33
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper
•
2307.13854
•
Published
•
24
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
General Medical AI
Paper
•
2408.03361
•
Published
•
86
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Paper
•
2408.14354
•
Published
•
41
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated
clinical environments
Paper
•
2405.07960
•
Published
•
1
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft
Reasoning
Paper
•
2310.16049
•
Published
•
4
MMSearch: Benchmarking the Potential of Large Models as Multi-modal
Search Engines
Paper
•
2409.12959
•
Published
•
37
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
•
2409.07703
•
Published
•
67
HelloBench: Evaluating Long Text Generation Capabilities of Large
Language Models
Paper
•
2409.16191
•
Published
•
42
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
•
2409.15272
•
Published
•
27
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large
Language Models
Paper
•
2410.07985
•
Published
•
28
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
•
2404.07972
•
Published
•
46
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
•
2412.15204
•
Published
•
33
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
•
2412.14161
•
Published
•
49
Are Your LLMs Capable of Stable Reasoning?
Paper
•
2412.13147
•
Published
•
91
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
52
CodeElo: Benchmarking Competition-level Code Generation of LLMs with
Human-comparable Elo Ratings
Paper
•
2501.01257
•
Published
•
40
A3: Android Agent Arena for Mobile GUI Agents
Paper
•
2501.01149
•
Published
•
20
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on
Self-invoking Code Generation
Paper
•
2412.21199
•
Published
•
9
ResearchTown: Simulator of Human Research Community
Paper
•
2412.17767
•
Published
•
12
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Paper
•
2412.14470
•
Published
•
11
The BrowserGym Ecosystem for Web Agent Research
Paper
•
2412.05467
•
Published
•
19
Evaluating Language Models as Synthetic Data Generators
Paper
•
2412.03679
•
Published
•
46
Paper
•
2412.04315
•
Published
•
17
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills
in LLMs
Paper
•
2412.03205
•
Published
•
16
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
•
2411.06176
•
Published
•
45
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for
Evaluating Foundation Models
Paper
•
2411.04075
•
Published
•
16
From Medprompt to o1: Exploration of Run-Time Strategies for Medical
Challenge Problems and Beyond
Paper
•
2411.03590
•
Published
•
10