Spaces:
Running
Running
title: README | |
emoji: π | |
colorFrom: blue | |
colorTo: pink | |
sdk: static | |
pinned: false | |
## Consensus-driven LLM Evaluation | |
The rapid advancement of Large Language Models (LLMs) necessitates robust | |
and challenging benchmarks. | |
To address the challenge of ranking LLMs on highly subjective tasks such as emotional intelligence, creative writing, or persuasiveness, | |
the **Language Model Council (LMC)** operates through a democratic process to: 1) formulate a test set through | |
equal participation, 2) administer the test among council members, and 3) evaluate | |
responses as a collective jury. | |
Our initial research deploys a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, | |
and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks like Chatbot Arena or MMLU. | |
Roadmap: | |
- Use the Council to benchmark evaluative characteristics of LLM-as-a-Judge/Jury like bias, affinity, and agreement. | |
- Expand to more domains, use cases, and sophisticated agentic interactions. | |
- Produce a generalized user interface for Council-as-a-Service. | |