Benchmarks - a kaizuberbuehler Collection

kaizuberbuehler 's Collections

Vision Language Models

Foundation Models

Synthetic Data and Self-Improvement

Agents

LM Prompt Engineering

LM Capabilities and Scaling

Music Generation

LM Architectures

Code Generation

EXL2 Quantized Models

Benchmarks

updated 2 days ago

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 187
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper • 2311.16502 • Published Nov 27, 2023 • 35
BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 24
RULER: What's the Real Context Size of Your Long-Context Language Models?

Paper • 2404.06654 • Published Apr 9, 2024 • 34
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues

Paper • 2404.03820 • Published Apr 4, 2024 • 24
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Paper • 2404.03543 • Published Apr 4, 2024 • 15
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Paper • 2404.16820 • Published Apr 25, 2024 • 15
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Paper • 2404.16790 • Published Apr 25, 2024 • 7
On the Planning Abilities of Large Language Models -- A Critical Investigation

Paper • 2305.15771 • Published May 25, 2023 • 1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31, 2024 • 21
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Paper • 2406.09170 • Published Jun 13, 2024 • 24
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Paper • 2406.09411 • Published Jun 13, 2024 • 18
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Paper • 2406.08587 • Published Jun 12, 2024 • 15
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17, 2024 • 61
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Paper • 2406.09961 • Published Jun 14, 2024 • 55
Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11, 2024 • 53
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Paper • 2406.10149 • Published Jun 14, 2024 • 49
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Paper • 2407.18961 • Published Jul 18, 2024 • 40
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Paper • 2407.18901 • Published Jul 26, 2024 • 33
WebArena: A Realistic Web Environment for Building Autonomous Agents

Paper • 2307.13854 • Published Jul 25, 2023 • 24
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Paper • 2408.03361 • Published Aug 6, 2024 • 86
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Paper • 2408.14354 • Published Aug 26, 2024 • 41
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Paper • 2405.07960 • Published May 13, 2024 • 1
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Paper • 2310.16049 • Published Oct 24, 2023 • 4
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

Paper • 2409.12959 • Published Sep 19, 2024 • 37
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

Paper • 2409.07703 • Published Sep 12, 2024 • 67
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Paper • 2409.16191 • Published Sep 24, 2024 • 42
OmniBench: Towards The Future of Universal Omni-Language Models

Paper • 2409.15272 • Published Sep 23, 2024 • 27
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Paper • 2410.07985 • Published Oct 10, 2024 • 28
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Paper • 2404.07972 • Published Apr 11, 2024 • 46
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published 18 days ago • 33
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published 19 days ago • 49
Are Your LLMs Capable of Stable Reasoning?

Paper • 2412.13147 • Published 20 days ago • 91
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published 26 days ago • 52
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Paper • 2501.01257 • Published 4 days ago • 40
A3: Android Agent Arena for Mobile GUI Agents

Paper • 2501.01149 • Published 4 days ago • 20
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Paper • 2412.21199 • Published 7 days ago • 9
ResearchTown: Simulator of Human Research Community

Paper • 2412.17767 • Published 14 days ago • 12
Agent-SafetyBench: Evaluating the Safety of LLM Agents

Paper • 2412.14470 • Published 19 days ago • 11
The BrowserGym Ecosystem for Web Agent Research

Paper • 2412.05467 • Published about 1 month ago • 19
Evaluating Language Models as Synthetic Data Generators

Paper • 2412.03679 • Published Dec 4, 2024 • 46
Densing Law of LLMs

Paper • 2412.04315 • Published Dec 5, 2024 • 17
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Paper • 2412.03205 • Published Dec 4, 2024 • 16
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

Paper • 2411.06176 • Published Nov 9, 2024 • 45
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Paper • 2411.04075 • Published Nov 6, 2024 • 16
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Paper • 2411.03590 • Published Nov 6, 2024 • 10