Clinical & Biomedical ML Leaderboards

community

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

clefourrier authored a paper 16 days ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

shanchen authored a paper about 1 month ago

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

daniellebitt authored a paper about 1 month ago

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

View all activity

med-llm-leaderboard's activity

clefourrier

authored a paper 16 days ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published 18 days ago • 17

shanchen

authored a paper about 1 month ago

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Paper • 2411.06469 • Published Nov 10 • 17

daniellebitt

authored a paper about 1 month ago

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Paper • 2411.06469 • Published Nov 10 • 17

shanchen

authored 2 papers 2 months ago

Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for Misinformation

Paper • 2409.20385 • Published Sep 30

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

Paper • 2410.12722 • Published Oct 16 • 5

clefourrier

authored 2 papers 6 months ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8 • 8

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 185

daniellebitt

authored a paper 6 months ago

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Paper • 2406.12066 • Published Jun 17 • 8

gallifantjack

authored a paper 6 months ago

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Paper • 2406.12066 • Published Jun 17 • 8

NikolajMunch

authored 2 papers 6 months ago

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Paper • 2405.05506 • Published May 9 • 1

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Paper • 2406.12066 • Published Jun 17 • 8

shanchen

authored 3 papers 6 months ago

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Paper • 2406.12066 • Published Jun 17 • 8

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Paper • 2405.05506 • Published May 9 • 1

Measuring Pointwise $\mathcal{V}$-Usable Information In-Context-ly

Paper • 2310.12300 • Published Oct 18, 2023 • 1

clefourrier

posted an update 8 months ago

Post

5438

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

clefourrier

posted an update 8 months ago

Post

4418

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

clefourrier

posted an update 8 months ago

Post

2209

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

clefourrier

posted an update 9 months ago

Post

2216

Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

clefourrier

posted an update 9 months ago

Post

2352

Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.

4 replies

clefourrier

posted an update 9 months ago

Post

2012

Are you looking for the perfect leaderboard/arena for your use case? 👀

There's a new tool for this!
https://huggingface.co/spaces/leaderboards/LeaderboardFinder

Select your modality, language, task... then search! 🔍
Some categories of interest:
- does the leaderboard accept submissions?
- is the test set private or public?
- is it using an automatic metric, human evaluators, or llm as a judge?

The spaces list is build from space metadata, and reloaded every hour.

Enjoy!

AI & ML interests

Recent Activity

Team members 8

med-llm-leaderboard's activity