Spaces:
Running
Running
File size: 1,264 Bytes
f6a9450 3523a09 441a25f 33cc960 441a25f b286409 441a25f 33cc960 441a25f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
---
title: README
emoji: π
colorFrom: blue
colorTo: pink
sdk: static
pinned: false
---
## Consensus-driven LLM Evaluation
The rapid advancement of Large Language Models (LLMs) necessitates robust
and challenging benchmarks.
To address the challenge of ranking LLMs on highly subjective tasks such as emotional intelligence, creative writing, or persuasiveness,
the **Language Model Council (LMC)** operates through a democratic process to: 1) formulate a test set through
equal participation, 2) administer the test among council members, and 3) evaluate
responses as a collective jury.
Our initial research deploys a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust,
and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks like Chatbot Arena or MMLU.
Roadmap:
- Use the Council to benchmark evaluative characteristics of LLM-as-a-Judge/Jury like bias, affinity, and agreement.
- Expand to more domains, use cases, and sophisticated agentic interactions.
- Produce a generalized user interface for Council-as-a-Service.
|