Post
312
๐ก๐ฒ๐ ๐น๐ฒ๐ฎ๐ฑ๐ฒ๐ฟ๐ฏ๐ผ๐ฎ๐ฟ๐ฑ ๐ฟ๐ฎ๐ป๐ธ๐ ๐๐๐ ๐ ๐ณ๐ผ๐ฟ ๐๐๐ -๐ฎ๐-๐ฎ-๐ท๐๐ฑ๐ด๐ฒ: ๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ญ-๐ณ๐ฌ๐ ๐๐ผ๐ฝ๐ ๐๐ต๐ฒ ๐ฟ๐ฎ๐ป๐ธ๐ถ๐ป๐ด๐! ๐งโโ๏ธ
Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.
First, what is "LLM-as-a-judge"?
๐ It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilร , you get your score.
๐ง But who judges the judge?
How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.
๐ Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!
And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!
Take a deeper look here ๐ https://huggingface.co/blog/arena-atla
Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.
First, what is "LLM-as-a-judge"?
๐ It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilร , you get your score.
๐ง But who judges the judge?
How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.
๐ Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!
And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!
Take a deeper look here ๐ https://huggingface.co/blog/arena-atla