File size: 1,264 Bytes
f6a9450
 
 
 
 
 
 
 
 
3523a09
 
441a25f
 
 
33cc960
441a25f
 
 
 
 
b286409
441a25f
 
 
33cc960
441a25f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
---
title: README
emoji: πŸ‘€
colorFrom: blue
colorTo: pink
sdk: static
pinned: false
---

## Consensus-driven LLM Evaluation

The rapid advancement of Large Language Models (LLMs) necessitates robust
and challenging benchmarks. 

To address the challenge of ranking LLMs on highly subjective tasks such as emotional intelligence, creative writing, or persuasiveness, 
the **Language Model Council (LMC)** operates through a democratic process to: 1) formulate a test set through
equal participation, 2) administer the test among council members, and 3) evaluate
responses as a collective jury. 

Our initial research deploys a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust,
and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks like Chatbot Arena or MMLU.

Roadmap:

- Use the Council to benchmark evaluative characteristics of LLM-as-a-Judge/Jury like bias, affinity, and agreement.
- Expand to more domains, use cases, and sophisticated agentic interactions.
- Produce a generalized user interface for Council-as-a-Service.