Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update Nov 21
Post
796
๐—ก๐—ฒ๐˜„ ๐—น๐—ฒ๐—ฎ๐—ฑ๐—ฒ๐—ฟ๐—ฏ๐—ผ๐—ฎ๐—ฟ๐—ฑ ๐—ฟ๐—ฎ๐—ป๐—ธ๐˜€ ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ: ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ-๐Ÿฏ.๐Ÿญ-๐Ÿณ๐Ÿฌ๐—• ๐˜๐—ผ๐—ฝ๐˜€ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฎ๐—ป๐—ธ๐—ถ๐—ป๐—ด๐˜€! ๐Ÿง‘โ€โš–๏ธ

Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.

First, what is "LLM-as-a-judge"?
๐Ÿ‘‰ It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilร , you get your score.

๐Ÿง But who judges the judge?
How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.

๐Ÿ“Š Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!

And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!

Take a deeper look here ๐Ÿ‘‰ https://huggingface.co/blog/arena-atla
In this post