1061 40 168

Clémentine Fourrier

clefourrier

http://clefourrier.github.io

AI & ML interests

None yet

Recent Activity

updated a dataset 2 days ago

gaia-benchmark/results_public

new activity 5 days ago

open-llm-leaderboard/open_llm_leaderboard:Proposal for new column

liked a model 5 days ago

utter-project/EuroLLM-1.7B

View all activity

Articles

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Jun 18

• 43

Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens tokens and 11 languages

May 24

• 25

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

May 24

• 21

Let's talk about LLM evaluation

May 23

• 140

Introducing the Open Arabic LLM Leaderboard

May 14

• 76

Introducing the Open Leaderboard for Hebrew LLMs!

May 5

• 32

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

May 3

• 13

Improving Prompt Consistency with Structured Generations

Apr 30

• 58

Introducing the Open Chain of Thought Leaderboard

Apr 23

• 27

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Apr 19

• 126

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Apr 16

• 14

Introducing the Chatbot Guardrails Arena

Mar 21

• 4

Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes?

Mar 5

• 4

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Feb 27

• 41

Introducing the Red-Teaming Resistance Leaderboard

Feb 23

• 13

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

Feb 20

• 3

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Feb 2

• 3

Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases

Jan 31

• 3

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Jan 29

• 17

A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard

Jan 12

• 6

2023, year of open LLMs

Dec 18, 2023

• 5

Open LLM Leaderboard: DROP deep dive

Dec 1, 2023

• 5

Overview of natively supported quantization schemes in 🤗 Transformers

Sep 12, 2023

• 11

What's going on with the Open LLM Leaderboard?

Jun 23, 2023

• 23

Introduction to Graph Machine Learning

Jan 3, 2023

• 19

Organizations

clefourrier's activity

reacted to fdaudens's post with ❤️ 20 days ago

Post

1732

Keeping up with open-source AI in 2024 = overwhelming.

Here's help: We're launching our Year in Review on what actually matters, starting today!

Fresh content dropping daily until year end. Come along for the ride - first piece out now with @clem 's predictions for 2025.

Think of it as your end-of-year AI chocolate calendar.

Kudos to @BrigitteTousi @clefourrier @Wauplin @thomwolf for making it happen. We teamed up with aiworld.eu for awesome visualizations to make this digestible—it's a charm to work with their team.

Check it out: huggingface/open-source-ai-year-in-review-2024

reacted to thomwolf's post with 🧠🔥 28 days ago

Post

1633

Interesting long read from @evanmiller-anthropic on having a better founded statistical approach to Language Model Evaluations:
https://www.anthropic.com/research/statistical-approach-to-model-evals

Worth a read if you're into LLM evaluations!

Cc @clefourrier

1 reply

reacted to malhajar's post with 🔥 about 1 month ago

Post

4204

🇫🇷 Lancement officiel de l'OpenLLM French Leaderboard : initiative open-source pour référencer l’évaluation des LLMs francophones

Après beaucoup d’efforts et de sueurs avec Alexandre Lavallee, nous sommes ravis d’annoncer que le OpenLLMFrenchLeaderboard est en ligne sur Hugging Face (space url: le-leadboard/OpenLLMFrenchLeaderboard) la toute première plateforme dédiée à l’évaluation des grands modèles de langage (LLM) en français. 🇫🇷✨

Ce projet de longue haleine est avant tout une œuvre de passion mais surtout une nécessité absolue. Il devient urgent et vital d'oeuvrer à plus de transparence dans ce domaine stratégique des LLM dits multilingues. La première pièce à l'édifice est donc la mise en place d'une évaluation systématique et systémique des modèles actuels et futurs.

Votre modèle IA français est-il prêt à se démarquer ? Soumettez le dans notre espace, et voyez comment vous vous comparez par rapport aux autres modèles.

❓ Comment ça marche :
Soumettez votre LLM français pour évaluation, et nous le testerons sur des benchmarks de référence spécifiquement adaptés pour la langue française — notre suite de benchmarks comprend :

- BBH-fr : Raisonnement complexe
- IFEval-fr : Suivi d'instructions
- GPQA-fr : Connaissances avancées
- MUSR-fr : Raisonnement narratif
- MATH_LVL5-fr : Capacités mathématiques
- MMMLU-fr : Compréhension multitâche

Le processus est encore manuel, mais nous travaillons sur son automatisation, avec le soutien de la communauté Hugging Face.

@clem , on se prépare pour une mise à niveau de l’espace ? 😏👀

Ce n'est pas qu'une question de chiffres—il s'agit de créer une IA qui reflète vraiment notre langue, notre culture et nos valeurs. OpenLLMFrenchLeaderboard est notre contribution personnelle pour façonner l'avenir des LLM en France.

1 reply

reacted to fdaudens's post with ❤️ 6 months ago

Post

1878

Look at that 👀

Actual benchmarks have become too easy for recent models, much like grading high school students on middle school problems makes little sense. So the team worked on a new version of the Open LLM Leaderboard with new benchmarks.

Stellar work from @clefourrier @SaylorTwift and the team!

👉 Read the blog post: open-llm-leaderboard/blog
👉 Explore the leaderboard: open-llm-leaderboard/open_llm_leaderboard

1 reply

reacted to alvdansen's post with 👍 6 months ago

Post

6845

I had a backlog of LoRA model weights for SDXL that I decided to prioritize this weekend and publish. I know many are using SD3 right now, however if you have the time to try them, I hope you enjoy them.

I intend to start writing more fully on the thought process behind my approach to curating and training style and subject finetuning, beginning this next week.

Thank you for reading this post! You can find the models on my page and I'll drop a few previews here.

4 replies

reacted to fffiloni's post with ❤️ 7 months ago

Post

19501

🇫🇷
Quel impact de l’IA sur les filières du cinéma, de l’audiovisuel et du jeu vidéo?
Etude prospective à destination des professionnels
— CNC & BearingPoint | 09/04/2024

Si l’Intelligence Artificielle (IA) est utilisée de longue date dans les secteurs du cinéma, de l’audiovisuel et du jeu vidéo, les nouvelles applications de l’IA générative bousculent notre vision de ce dont est capable une machine et possèdent un potentiel de transformation inédit. Elles impressionnent par la qualité de leurs productions et suscitent par conséquent de nombreux débats, entre attentes et appréhensions.

Le CNC a donc décider de lancer un nouvel Observatoire de l’IA Afin de mieux comprendre les usages de l’IA et ses impacts réels sur la filière de l’image. Dans le cadre de cet Observatoire, le CNC a souhaité dresser un premier état des lieux à travers la cartographie des usages actuels ou potentiels de l’IA à chaque étape du processus de création et de diffusion d’une œuvre, en identifiant les opportunités et risques associés, notamment en termes de métiers et d’emploi. Cette étude CNC / Bearing Point en a présenté les principaux enseignements le 6 mars, lors de la journée CNC « Créer, produire, diffuser à l’heure de l’intelligence artificielle ».

Le CNC publie la version augmentée de la cartographie des usages de l’IA dans les filières du cinéma, de l’audiovisuel et du jeu vidéo.

Lien vers la cartographie complète: https://www.cnc.fr/documents/36995/2097582/Cartographie+des+usages+IA_rapport+complet.pdf/96532829-747e-b85e-c74b-af313072cab7?t=1712309387891

4 replies

reacted to alielfilali01's post with 🔥 7 months ago

Post

1466

Just passed the 25 models milestone on the OALL/Open-Arabic-LLM-Leaderboard 🥳

And now meta-llama/Meta-Llama-3-70B-Instruct is the new hero of the leaderboard beating CohereForAI/c4ai-command-r-v01 by 5.43 points 🔥

Almost another 80 models are still PENDING ! So this might change very fast in the upcoming days

reacted to giux78's post with ❤️ 7 months ago

Post

1477

@FinancialSupport and I just released a new version of the Italian LLMs leaderboard https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard
using the super useful https://huggingface.co/demo-leaderboard template from @clefourrier .
We’ve evaluated over 50 models (base, merged, fine-tuned, etc.) from:
- Major companies like Meta, Mistral, Google ...
- University groups such as https://huggingface.co/sapienzanlp or https://huggingface.co/swap-uniba
- Italian Companies like https://huggingface.co/MoxoffSpA , https://huggingface.co/FairMind or https://huggingface.co/raicrits
- Various communities and individuals
All models were tested on #Italian benchmarks #mmlu #arc-c #hellaswag, which we contributed to the opensource lm-evaluation-harness library from https://huggingface.co/EleutherAI.
Plus, you can now submit your model for automatic evaluation, thanks to to https://huggingface.co/seeweb sponsored computation.
Curious about the top Italian models? Check out the leaderboard and submit your model!

https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard

reacted to singhsidhukuldeep's post with ❤️ 7 months ago

Post

1325

🎉 A new LLM is launched! 🚀
After checking if it's open-source or not, 🤔
you rush to see the benchmarks... 🏃‍♂️💨

Which benchmark does everyone check first? 🔍

MMLU (Massive Multitask Language Understanding)? 📚

Benchmarks like MMLU reaching saturation... most of the time the performance does not translate to real-world use cases! 🌐❗

Meet MMLU-Pro, released by TIGER-Lab on @huggingface ! 🐯🌍

🧪 12,217 questions across biology, business, chemistry, computer science, economics, engineering, health, history, law, mathematics, philosophy, physics, and psychology carefully validated by humans 🧑‍🔬

🔟 Goes to 10 options per question instead of 4, this increase in options will make the evaluation more realistic and reduce random guessing 🎯

📊 56% of questions come from MMLU, 34% from STEM websites, and the rest from TheoremQA and SciBench 📈

🤖 LLMs with weak chain-of-thought reasoning tend to perform lower, indicating it is more challenging and representative of real-world expectations 🧠💡

Any guess who tops it and who bombs it? 🤔📉📈

GPT-4o drops by 17% (from 0.887 to 0.7149) 📉
Llama-3-70B drops by 27% (from 0.820 to 0.5541) 📉

🔗 TIGER-Lab/MMLU-Pro

2 replies

reacted to georgewritescode's post with ❤️🔥 8 months ago

Post

2322

Excited to bring our benchmarking leaderboard of >100 LLM API endpoints to HF!

Speed and price are often just as important as quality when building applications with LLMs. We bring together all the data you need to consider all three when you need to pick a model and API provider.

Coverage:
‣ Quality (Index of evals, MMLU, Chatbot Arena, HumanEval, MT-Bench)
‣ Throughput (tokens/s: median, P5, P25, P75, P95)
‣ Latency (TTFT: median, P5, P25, P75, P95)
‣ Context window
‣ OpenAI library compatibility

Link to Space: ArtificialAnalysis/LLM-Performance-Leaderboard

Blog post: https://huggingface.co/blog/leaderboard-artificial-analysis

reacted to osanseviero's post with 🔥 8 months ago

Post

9785

Diaries of Open Source. Part 15 🤗

🕵️‍♀️Idefics 2 is out, a multimodal open-source model with very nice capabilities
Models, demo, and datasets: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blog: https://hf.co/blog/idefics2

💾Snowflake released snowflake-arctic-embed, a family of powerful small embedding models
Model: Snowflake/snowflake-arctic-embed-m
Blog: https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/

✨Pile-T5, EleutherAI's T5 model trained on 2T tokens
Blog: https://blog.eleuther.ai/pile-t5/
Models: EleutherAI/pile-t5-65a76a0d0022dd270b385a66
GitHub: https://github.com/EleutherAI/improved-t5

🤖CodeQwen1.5-7B base and chat models. Models trained on 3T tokens strong benchmark results for code generation, editing and SQL
Blog post: https://qwenlm.github.io/blog/codeqwen1.5/
Demo: Qwen/CodeQwen1.5-7b-Chat-demo
Models: Qwen/CodeQwen1.5-7B and Qwen/CodeQwen1.5-7B-Chat

Misc
🦉 DocOwl1.5: Unified Stucture Learning for OCR-free Document Understanding mPLUG/DocOwl
👀Cerule - a tiny Vision LM model Tensoic/Cerule-v0.1
ChemLLM - a LLM for chemistry and molecule science ⚗️https://hf.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO
Distil Whisper Large
📝New pdf/OCR datasets with 19 samples pixparse/pdf-document-ocr-datasets-660701430b0346f97c4bc628
🔥Gretel AI high quality text-to-sql synthetic dataset gretelai/synthetic_text_to_sql

4 replies

reacted to isidentical's post with ❤️ 8 months ago

Post

2104

Happy to announce https://imgsys.org -- a sister project to Chatbot Arena by lmsys -- for comparing different text guided image generation models models. Try it natively on HuggingFace: https://huggingface.co/spaces/fal-ai/imgsys

1 reply

reacted to abidlabs's post with ❤️ 8 months ago

Post

3618

Open Models vs. Closed APIs for Software Engineers
-----------------------------------------------------------------------

If you're an ML researcher / scientist, you probably don't need much convincing to use open models instead of closed APIs -- open models give you reproducibility and let you deeply investigate the model's behavior.

But what if you are a software engineer building products on top of LLMs? I'd argue that open models are a much better option even if you are using them as APIs. For at least 3 reasons:

1) The most obvious reason is reliability of your product. Relying on a closed API means that your product has a single point-of-failure. On the other hand, there are at least 7 different API providers that offer Llama3 70B already. As well as libraries that abstract on top of these API providers so that you can make a single request that goes to different API providers depending on availability / latency.

2) Another benefit is eventual consistency going local. If your product takes off, it will be more economical and lower latency to have a dedicated inference endpoint running on your VPC than to call external APIs. If you've started with an open-source model, you can always deploy the same model locally. You don't need to modify prompts or change any surrounding logic to get consistent behavior. Minimize your technical debt from the beginning.

3) Finally, open models give you much more flexibility. Even if you keep using APIs, you might want to tradeoff latency vs. cost, or use APIs that support batches of inputs, etc. Because different API providers have different infrastructure, you can use the API provider that makes the most sense for your product -- or you can even use multiple API providers for different users (free vs. paid) or different parts of your product (priority features vs. nice-to-haves)

reacted to alozowski's post with 🔥❤️ 8 months ago

Post

2542

Do I need to make it a tradition to post here every Friday? Well, here we are again!

This week, I'm happy to share that we have two official Mistral models on the Leaderboard! 🔥 You can check them out: mistralai/Mixtral-8x22B-Instruct-v0.1 and mistralai/Mixtral-8x22B-v0.1

The most exciting thing here? mistralai/Mixtral-8x22B-Instruct-v0.1 model got a first place among pretrained models with an impressive average score of 79.15!🥇 Not far behind is the Mixtral-8x22B-v0.1, achieving second place with an average score of 74.47! Well done, Mistral AI! 👏

Check out my screenshot here or explore it yourself at the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

The second news is that CohereForAI/c4ai-command-r-plus model in 4-bit quantization got a great average score of 70.08. Cool stuff, Cohere! 😎 (and I also have the screenshot for this, don't miss it)

The last news, which might seem small but is still significant, the Leaderboard frontpage now supports Python 3.12.1. This means we're on our way to speed up the Leaderboard's performance! 🚀

If you have any comments or suggestions, feel free to also tag me on X (Twitter), I'll try to help – [at]ailozovskaya

Have a nice weekend! ✨

2 replies

posted an update 8 months ago

Post

5441

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

reacted to sted97's post with 🔥 8 months ago

Post

2451

📣 I'm thrilled to announce "ALERT: A Comprehensive #Benchmark for Assessing #LLMs’ Safety through #RedTeaming" 🚨

📄 Paper: https://arxiv.org/pdf/2404.08676.pdf
🗃️ Repo: https://github.com/Babelscape/ALERT
🤗 ALERT benchmark: Babelscape/ALERT
🤗 ALERT DPO data: Babelscape/ALERT_DPO

As a key design principle for ALERT, we developed a fine-grained safety risk taxonomy (Fig. 2). This taxonomy serves as the foundation for the benchmark to provide detailed insights about a model’s weaknesses and vulnerabilities as well as inform targeted safety enhancements 🛡️

For collecting our prompts, we started from the popular
Anthropic's HH-RLHF data, and used automated strategies to filter/classify prompts. We then designed templates to create new prompts (providing sufficient support for each category, cf. Fig. 3) and implemented adversarial attacks.

In our experiments, we extensively evaluated several open- and closed-source LLMs (e.g. #ChatGPT, #Llama and #Mistral), highlighting their strengths and weaknesses (Table 1).

For more details, check out our preprint: https://arxiv.org/pdf/2404.08676.pdf 🤓

Huge thanks to @felfri , @PSaiml , Kristian Kersting, @navigli , @huu-ontocord and @BoLi-aisecure (and all the organizations involved: Babelscape, Sapienza NLP, TU Darmstadt, Hessian.AI, DFKI, Ontocord.AI, UChicago and UIUC)🫂

1 reply

posted an update 8 months ago

Post

4422

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

Clémentine Fourrier

AI & ML interests

Recent Activity

Articles

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Introduction to the Open Leaderboard for Japanese LLMs

Letting Large Models Debate: The First Multilingual LLM Debate Competition

Judge Arena: Benchmarking LLMs as Evaluators

Introducing the Open FinLLM Leaderboard

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens tokens and 11 languages

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

Let's talk about LLM evaluation

Introducing the Open Arabic LLM Leaderboard

Introducing the Open Leaderboard for Hebrew LLMs!

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

Improving Prompt Consistency with Structured Generations

Introducing the Open Chain of Thought Leaderboard

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Introducing the Chatbot Guardrails Arena

Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes?

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Introducing the Red-Teaming Resistance Leaderboard

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard

2023, year of open LLMs

Open LLM Leaderboard: DROP deep dive

Overview of natively supported quantization schemes in 🤗 Transformers

What's going on with the Open LLM Leaderboard?

Introduction to Graph Machine Learning

Organizations

clefourrier's activity