JuStRank: Benchmarking LLM Judges for System Ranking
Abstract
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
Community
❤️
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- JudgeBench: A Benchmark for Evaluating LLM-based Judges (2024)
- From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge (2024)
- Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data (2024)
- Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models (2024)
- Diverging Preferences: When do Annotators Disagree and do Models Know? (2024)
- MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems (2024)
- CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper