llm-council (Language Model Council)

Organization Card

Consensus-driven LLM Evaluation

The rapid advancement of Large Language Models (LLMs) necessitates robust and challenging benchmarks.

To address the challenge of ranking LLMs on highly subjective tasks such as emotional intelligence, creative writing, or persuasiveness, the Language Model Council (LMC) operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury.

Our initial research deploys a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks like Chatbot Arena or MMLU.

Roadmap:

Use the Council to benchmark evaluative characteristics of LLM-as-a-Judge/Jury like bias, affinity, and agreement.
Expand to more domains, use cases, and sophisticated agentic interactions.
Produce a generalized user interface for Council-as-a-Service.

Emotional Intelligence Data Browser

models

None public yet

datasets 1

llm-council/emotional_application

Viewer • Updated Jul 15 • 81.5k • 126 • 2

Language Model Council

AI & ML interests

Recent Activity

Consensus-driven LLM Evaluation

spaces 3

Sandbox

Alpaca Eval Visualizations

Emotional Intelligence Data Browser

models

datasets 1

llm-council/emotional_application

AI & ML interests

Recent Activity

Team members 2

Consensus-driven LLM Evaluation

spaces 3 Sort: Recently updated

Sandbox

Alpaca Eval Visualizations

Emotional Intelligence Data Browser

models

datasets 1

spaces 3