Spaces:

llm-council
/

README

Running

README / README.md

Update README.md

3523a09 verified 10 months ago

1.26 kB

	---
	title: README
	emoji: 👀
	colorFrom: blue
	colorTo: pink
	sdk: static
	pinned: false
	---

	## Consensus-driven LLM Evaluation

	The rapid advancement of Large Language Models (LLMs) necessitates robust
	and challenging benchmarks.

	To address the challenge of ranking LLMs on highly subjective tasks such as emotional intelligence, creative writing, or persuasiveness,
	the Language Model Council (LMC) operates through a democratic process to: 1) formulate a test set through
	equal participation, 2) administer the test among council members, and 3) evaluate
	responses as a collective jury.

	Our initial research deploys a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust,
	and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks like Chatbot Arena or MMLU.

	Roadmap:

	- Use the Council to benchmark evaluative characteristics of LLM-as-a-Judge/Jury like bias, affinity, and agreement.
	- Expand to more domains, use cases, and sophisticated agentic interactions.
	- Produce a generalized user interface for Council-as-a-Service.