HuggingFaceEval

non-profit

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

hynky authored a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

clefourrier authored a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

thomwolf authored a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

View all activity

HuggingFaceEvalInternal's activity

hynky

authored a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 5 days ago • 140

clefourrier

authored a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 5 days ago • 140

thomwolf

authored a paper 3 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 5 days ago • 140

albertvillanova

posted an update 5 days ago

Post

2731

🚀 Introducing @huggingface Open Deep-Research💥

In just 24 hours, we built an open-source agent that:
✅ Autonomously browse the web
✅ Search, scroll & extract info
✅ Download & manipulate files
✅ Run calculations on data

55% on GAIA validation set! Help us improve it!💡
https://huggingface.co/blog/open-deep-research

3 replies

hynky

authored a paper 24 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 26 days ago • 53

thomwolf

authored a paper 24 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 26 days ago • 53

albertvillanova

posted an update about 1 month ago

Post

2044

Discover all the improvements in the new version of Lighteval: https://huggingface.co/docs/lighteval/

thomwolf

posted an update 2 months ago

Post

5347

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

2 replies

clefourrier

authored a paper 2 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 17

thomwolf

posted an update 2 months ago

Post

1549

Exponentially growing number of open-source AI models over the course of the past 30 months – from a few thousands to over 1 million and more

Interactive data viz: huggingface/open-source-ai-year-in-review-2024

thomwolf

posted an update 2 months ago

Post

1496

Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: https://aiworld.eu/embed/model/model/treemap
Discussion: huggingface/open-source-ai-year-in-review-2024

thomwolf

posted an update 3 months ago

Post

1719

Interesting long read from @evanmiller-anthropic on having a better founded statistical approach to Language Model Evaluations:
https://www.anthropic.com/research/statistical-approach-to-model-evals

Worth a read if you're into LLM evaluations!

Cc @clefourrier

1 reply

SaylorTwift

posted an update 3 months ago

Post

493

How do I test an LLM for my unique needs?
If you work in finance, law, or medicine, generic benchmarks are not enough.
This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.

https://github.com/argilla-io/argilla-cookbook/blob/main/domain-eval/README.md

albertvillanova

posted an update 3 months ago

Post

1771

🚨 How green is your model? 🌱 Introducing a new feature in the Comparator tool: Environmental Impact for responsible #LLM research!
👉 open-llm-leaderboard/comparator
Now, you can not only compare models by performance, but also by their environmental footprint!

🌍 The Comparator calculates CO₂ emissions during evaluation and shows key model characteristics: evaluation score, number of parameters, architecture, precision, type... 🛠️
Make informed decisions about your model's impact on the planet and join the movement towards greener AI!

thomwolf

posted an update 3 months ago

Post

1457

Very exciting new mistralai/Pixtral-Large-Instruct-2411 model from Mistral-AI

Impressive performances, huge congrats @patrickvonplaten @sgvaze @pandora-s @devendrachaplot @sophiamyang and team!

Very nice to have SOTA Multilingual OCR and Chart understanding in an open-weights model

albertvillanova

posted an update 3 months ago

Post

1579

🚀 New feature of the Comparator of the 🤗 Open LLM Leaderboard: now compare models with their base versions & derivatives (finetunes, adapters, etc.). Perfect for tracking how adjustments affect performance & seeing innovations in action. Dive deeper into the leaderboard!

🛠️ Here's how to use it:
1. Select your model from the leaderboard.
2. Load its model tree.
3. Choose any base & derived models (adapters, finetunes, merges, quantizations) for comparison.
4. Press Load.
See side-by-side performance metrics instantly!

Ready to dive in? 🏆 Try the 🤗 Open LLM Leaderboard Comparator now! See how models stack up against their base versions and derivatives to understand fine-tuning and other adjustments. Easier model analysis for better insights! Check it out here: open-llm-leaderboard/comparator 🌐

albertvillanova

posted an update 3 months ago

Post

3156

🚀 Exciting update! You can now compare multiple models side-by-side with the Hugging Face Open LLM Comparator! 📊

open-llm-leaderboard/comparator

Dive into multi-model evaluations, pinpoint the best model for your needs, and explore insights across top open LLMs all in one place. Ready to level up your model comparison game?

thomwolf

posted an update 4 months ago

Post

4190

Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨

2 replies

albertvillanova

posted an update 4 months ago

Post

1241

🚨 Instruct-tuning impacts models differently across families! Qwen2.5-72B-Instruct excels on IFEval but struggles with MATH-Hard, while Llama-3.1-70B-Instruct avoids MATH performance loss! Why? Can they follow the format in examples? 📊 Compare models: open-llm-leaderboard/comparator

albertvillanova

posted an update 4 months ago

Post

1941

Finding the Best SmolLM for Your Project

Need an LLM assistant but unsure which hashtag#smolLM to run locally? With so many models available, how can you decide which one suits your needs best? 🤔

If the model you’re interested in is evaluated on the Hugging Face Open LLM Leaderboard, there’s an easy way to compare them: use the model Comparator tool: open-llm-leaderboard/comparator
Let’s walk through an example👇

Let’s compare two solid options:
- Qwen2.5-1.5B-Instruct from Alibaba Cloud Qwen (1.5B params)
- gemma-2-2b-it from Google (2.5B params)

For an assistant, you want a model that’s great at instruction following. So, how do these two models stack up on the IFEval task?

What about other evaluations?
Both models are close in performance on many other tasks, showing minimal differences. Surprisingly, the 1.5B Qwen model performs just as well as the 2.5B Gemma in many areas, even though it's smaller in size! 📊

This is a great example of how parameter size isn’t everything. With efficient design and training, a smaller model like Qwen2.5-1.5B can match or even surpass larger models in certain tasks.

Looking for other comparisons? Drop your model suggestions below! 👇

AI & ML interests

Recent Activity

Team members 7

HuggingFaceEvalInternal's activity