Those score benchmarks look insane ...

#2
by mirek190 - opened

look

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

I curious how look benchmarks for ( programming ) MT-Bench, CoT, HumanEval+, LM-Eval .... how to do that or find it?

Benchmark scores can be misleading, so take them all with a grain of salt.

I haven't tested against those benchmarks; it takes a lot of time and resources to run some of them. I may try a few, but some (alpaca eval for example), this model performs worse than others because it is uncensored and answers "bad" questions.

This comment has been hidden

Any chance to just do the MT-bench? As it's a writing/ERP focused model as opposed to a coding focused model, it'd be one of the better ones to run (if you have the time and resources of course). Thanks again for your work.

I'll take a look! This model does also have quite a few coding instructions so it may actually do fairly well. The focus is actually much heavier on coding and reasoning than on creative tasks/rp.

Man This is the second time I'm writing to you, you and @TheBloke are my heroes (and everyone's), thanks a lot for all the efforts you do.
Congratulation on the top spot, you are a one man army and I wish you all the best.
Respect!

These models are great, but how well do they compare to llama 2 chat models in multi-turn conversations? Is there a benchmark for this?

Not sure what happened but the scores dropped in leaderboard :(

It had some contamination so I purged and rebuilt.

Not sure what happened but the scores dropped in leaderboard :(

These models are great, but how well do they compare to llama 2 chat models in multi-turn conversations? Is there a benchmark for this?

I'm not aware of a benchmark for this purpose on 70b models. I know there are some benchmarks others have done for RP but they tend to stop at 34b.

jondurbin changed discussion status to closed

Sign up or log in to comment