@BramVanroy on Hugging Face: " 🎈 LLM Benchmarks Update! **tl;dr: do not depend on benchmark leaderboards…"

BramVanroy

posted an update Mar 28

Post

2439

🎈 LLM Benchmarks Update!

**tl;dr: do not depend on benchmark leaderboards to choose your "chatbot" model! (Especially for non-English languages.)**

First of all, I'm discontinuing the Open #Dutch #LLM Leaderboard (https://lnkd.in/eFnsaFR6). It will stay online for now, but I urge the use of the ScandEval leaderboard instead (https://scandeval.com/dutch-nlg/) by @saattrupdan . It contains more tasks, has better reproducibility and statistics (CI) and a flexible back-end library (scandeval) to run your own benchmarks with. As part of project "Leesplank" (with Michiel Buisman and Maarten Lens-FitzGerald) we recently added GPT-4-1106-preview scores to add a good "target" to the leaderboard.

An important note here is that benchmark leaderboards are not a golden truth. Especially evaluating generative models is hard. You run into issues like prompt engineering (and sensitivity of models to one or other prompt), structured output generation, and - quite simply - "how to automatically evaluate open-ended generation".

💡 Another important but under-discussed facet is the discrepancy between models' capability of understanding vs. generating *in different languages* (so the NLU part of NLG benchmarking). In other words: some of the listed models score really well on, e.g., MCQ benchmarks but are not suitable to use as DUTCH chat bots. Interestingly, some of these models seem to understand questions in Dutch and are able to pick the right answer (because they have good knowledge or reasoning skills), but generating fluent and grammatical Dutch is something else entirely! This is perhaps also true for humans: it's easier to sort-of grasp the meaning of a new language and answer with "Yes" or "No", but answering fluently in the language is much harder! Yet, your language production fluency does not necessarily say anything about your knowledge and reasoning skills.

Hopefully we can get a chat arena for Dutch some day - user feedback is the most powerful metric!

robinsmits

Mar 29

@BramVanroy The ScandEval Leaderboard and package look amazing. How can I request a model to be added and benchmarked? Should I reach out to one of the contributors?

BramVanroy

Mar 29

I think the correct place for this is to make a new issue on their issue tracker: https://github.com/ScandEval/ScandEval/issues

cnmoro

Mar 30

I have noticed exactly the same thing for portuguese language.
It appears that all LLMs that I try, understand the prompt perfectly, but fail miserably to generate coherent answers, without grammatical errors.
I have noticed that since most models are trained on the english language, if you add "\nAnswer in {language}" to the end of the prompt, the results are much better.

Join the conversation