Add MT-Bench, with Mixtral-8x7b judge?

#558
by andysalerno - opened

I'm aware of #459 and just want to bring some thoughts...

I think it would be interesting to explore using Mixtral-8x7b (which you would likely agree is the most powerful open model) as judge in the MT-Bench question set, and including that score in the leaderboard.

Some reasons why MT-Bench would be a good addition:

  • MT-Bench corresponds well to actual chat scenarios (anecdotal but intuitive)
  • MT-Bench uses the chat prompt of the model. Chat models that very strongly depend on their formats tend to be punished on the leaderboard, because they perform suboptimally when emitting free text outside the bounds of their formats (which is what the current benchmarks require). This encourages people to train models that are less dependent on prompt templates, but therefore also less good as chat agents (ok, I don't have data to prove that 'less dependent on template means less good agent', but I firmly believe this anyway :D).

Some reasons why Mixtral-8x7b would be a good judge (IMO):

  • It is self-hostable -- well, hostable by Huggingface at least :)
  • It is completely open, so it is not subject to secret behind-the-curtain changes like GPT-4 is. It will always give the same responses tomorrow as it does today, unlike GPT-4.
  • It may not be as powerful as GPT-4, and so therefore it may not be as good of a judge, but it seems reasonable that over the 80 question MT-Bench exam it can still extract a signal for roughly how well a given model performs compared to another.
Open LLM Leaderboard org

Hi!
It's unlikely we will add a model as a judge evaluation for the Open LLM Leaderboard (because of the compute costs), but if someone wants to set this up in a specific leaderboard, I'd be happy to discuss it!

clefourrier changed discussion status to closed

Sign up or log in to comment