open-llm-leaderboard/open_llm_leaderboard · Add MT-Bench, with Mixtral-8x7b judge?

Jan 24

•

I'm aware of #459 and just want to bring some thoughts...

I think it would be interesting to explore using Mixtral-8x7b (which you would likely agree is the most powerful open model) as judge in the MT-Bench question set, and including that score in the leaderboard.

Some reasons why MT-Bench would be a good addition:

MT-Bench corresponds well to actual chat scenarios (anecdotal but intuitive)
MT-Bench uses the chat prompt of the model. Chat models that very strongly depend on their formats tend to be punished on the leaderboard, because they perform suboptimally when emitting free text outside the bounds of their formats (which is what the current benchmarks require). This encourages people to train models that are less dependent on prompt templates, but therefore also less good as chat agents (ok, I don't have data to prove that 'less dependent on template means less good agent', but I firmly believe this anyway :D).

Some reasons why Mixtral-8x7b would be a good judge (IMO):

It is self-hostable -- well, hostable by Huggingface at least :)
It is completely open, so it is not subject to secret behind-the-curtain changes like GPT-4 is. It will always give the same responses tomorrow as it does today, unlike GPT-4.
It may not be as powerful as GPT-4, and so therefore it may not be as good of a judge, but it seems reasonable that over the 80 question MT-Bench exam it can still extract a signal for roughly how well a given model performs compared to another.

clefourrier

Open LLM Leaderboard org Jan 31

Hi!
It's unlikely we will add a model as a judge evaluation for the Open LLM Leaderboard (because of the compute costs), but if someone wants to set this up in a specific leaderboard, I'd be happy to discuss it!

clefourrier changed discussion status to closed Mar 4