Are there MT-Bench metrics produced for this model?

#4
by Jaredc - opened

Looking to see how this compares against e.g Openchat-3.5, GPT-3.5 etc

No, unfortunately, I would like to see it too.

I might run Alpaca v2 on this later when I have time
(but knowing Maxime he'll have 5 other models by that time).

MT-bench is a little pricer to run.
(and I don't have it set up right now)

FYI, I've run MT-Bench (from HEAD checkout as of 2024-01-28). Here's the turn 1 and 2 with context:

gpt-3.5-turbo               1     8.07500
neuralbeagle14-7b           1     7.93125
claude-instant-v1           1     7.80000

claude-v1                   2     7.650000
neuralbeagle14-7b           2     7.325000
nous-hermes-2-solar-10.7b   2     6.950000

Pretty impressive numbers MT-Bench scores!

Really cool, thanks @leonardlin !

NP, btw, possibly also of interest, I was just doing a category based analysis of some models. It's about as expected (weakest in reasoning, math, and code). Surprisingly, gpt-3.5-turbo does about as badly for reasoning, makes me wonder how small gpt-3.5-turbo really is as I generally, I find reasoning to be the hardest thing for smaller models to distill.

newplot(9).png

@mlabonne if you want to improve MT you should look into @SanjiWatsuki models.
He uses some interesting bases for merging great conversationalists.

https://huggingface.co/SanjiWatsuki/Sonya-7B
https://huggingface.co/SanjiWatsuki/Kunoichi-7B
https://huggingface.co/SanjiWatsuki/Kunoichi-DPO-v2-7B (this has no info just results of what's possible with further DPO)

Screenshot 2024-01-31 at 18.48.58.png

Discussions and reviews:

https://www.reddit.com/r/LocalLLaMA/comments/19e89dn/what_is_the_best_model_to_write_stories_that/
https://huggingface.co/SanjiWatsuki/Kunoichi-7B/discussions/1

Yes they have one NSFW in the merge, and the reddit thread is about NSFW too, but if you read the reviews they talk a lot about being smart and following instructions well.
(also the NSFW model can be left out of the merging there are others mentioned)

Which makes sense because MT-bench checks conversational skill.

Screenshot 2024-01-31 at 20.04.02.png

Screenshot 2024-01-31 at 20.04.29.png

Screenshot 2024-01-31 at 20.04.45.png

Kunoich-v2-DPO is just the mlabonne Orca DPO Pairs treatment but with more steps and a lower LR. Sonya-7B is somewhat of a meme merge that I made just to beat xDAN on MT-Bench - Kunoich-v2-DPO is a real attempt at a usable model.

Thanks for the additional sources @gblazex !

@SanjiWatsuki Did you find that more steps and a lower LR give better performance? It makes sense because the high LR is probably the reason why the loss quickly drops to zero and we stop learning.

What are your thoughts on MT-Bench? I tried a few models from gblazex's table and I don't find these results super intuitive. Like NeuralBeagle14 is pretty bad at multi-turn conversations but GPT-3.5-Turbo (and maybe claude-v1) looks underrated.

I did find that lowering the LR and running it longer helped. It certainly boosted the MT-Bench a lot more from the original attempt :)

I think MT-Bench is a good benchmark but it has similar faults to AlpacaEval. Because it is GPT-4 as judge, it has some of the same weaknesses like overly rewarding verbose answers. If you had to focus on a single number, I probably pick MT-Bench but I know you can release a stinker of a model that scores well on MT-Bench.

What I like about MT-Bench is that it can properly punish failure to generate EOS issues where the answers become winding, wildly off topic, and confusing. A lot of other benchmarks fail to properly penalize this behavior.

I use MT-Bench as just one data point in a mix of other benchmarks but, given its extremely strong correlation with ChatBot Arena Elo, I focus on it more than a lot of other numbers.

Samual was gracious enough to run 9 models overnight,
so we have a lot of data for EQ-bench Version 2 too

Spearman Correlations:
EQ-bench v2: 0.863
MT-bench: 0.891
Alpaca v2: 0.899

Kendall's Tau:
EQ-bench v2: 0.730
MT-bench: 0.759
Alpaca v2: 0.759

(I only checked overlapping rows where models have results for all 3 benchmarks)
https://github.com/EQ-bench/EQ-Bench/issues/4

EQ doesn't have the length bias. And it shows NeuralBeagle high which I like cause it's seen a lot of models through merging.

There won't be 1 benchmark that tells the whole story, but it's already visible that it's not enough to do well just on MT,
also not enough to do well just on EQ, ideally you want both.

That's what NeuralBeagle is showing, and I think Sanji's models would fare similarly too.

It makes sense to run your models through EQ bench v2 @SanjiWatsuki , I believe your old results were about v1.

Sign up or log in to comment