WildBench - a allenai Collection

allenai 's Collections

PixMo

OLMo 2

Tulu 3 Datasets

Molmo

OLMoE

Tulu V2.5 Suite

Paloma

SciRIFF

AI2 Safety Toolkit

Zebra Logic Bench

ACE

WildBench

updated Nov 27

Running

221

🦁

AI2 WildBench Leaderboard (V2)

Note The leaderboard for visualizing the results and collecting human feedback.
allenai/WildBench

Viewer • Updated Nov 4 • 2.3k • 2.36k • 34

Note Examples for evaluating LLMs.
allenai/WildBench-V2-Model-Outputs

Viewer • Updated Aug 1 • 62.5k • 1.66k • 2

Note The model outputs for verified LLMs on the leaderboard.
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Paper • 2406.04770 • Published Jun 7 • 27