61 121 431

Yacine Jernite

yjernite

https://yjernite.github.io/

AI & ML interests

Technical, community, and regulatory tools of AI governance @HuggingFace

Recent Activity

liked a model 14 days ago

PatronusAI/glider

liked a Space 14 days ago

PatronusAI/GLIDER

liked a model 15 days ago

answerdotai/ModernBERT-large

View all activity

Articles

🇪🇺✍️ EU AI Act: Systemic Risks in the First CoP Draft Comments ✍️🇪🇺

22 days ago

• 12

Open Source Developers Guide to the EU AI Act

Dec 2, 2024

• 35

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

Jul 3, 2024

• 8

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Jun 24, 2024

• 33

Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️

Mar 27, 2024

• 2

AI Watermarking 101: Tools and Techniques

Feb 26, 2024

• 15

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Dec 5, 2023

• 1

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 28

AI Policy @🤗: Open ML Considerations in the EU AI Act

Jul 24, 2023

• 2

AI Policy @🤗: Response to the U.S. NTIA's Request for Comment on AI Accountability

Jun 20, 2023

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

May 15, 2023

Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets

Nov 29, 2021

Organizations

yjernite's activity

liked a model 14 days ago

PatronusAI/glider

Text Generation • Updated about 12 hours ago • 1.18k • 29

liked a Space 14 days ago

Running

🦅

GLIDER

GLIDER: Grading LLM Interactions and Decisions using Explain

liked a model 15 days ago

answerdotai/ModernBERT-large

Fill-Mask • Updated 8 days ago • 21.6k • 281

upvoted a collection 15 days ago

ModernBERT

Collection

Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated 15 days ago • 111

liked a Space 15 days ago

Running

154

🏃

huggingface/policy-docs

Updated 15 days ago • 883 • 6

liked a Space 16 days ago

Running

419

📈

Scaling test-time compute

reacted to merve's post with 👀 17 days ago

Post

3149

Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models 🧶

✨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
✨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled 📈 scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful 🔥
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models 🔥

6 replies

reacted to fdaudens's post with 👀 17 days ago

Post

1297

Did a fun experiment: What are the main themes emerging from the 100+ Nieman Journalism Lab predictions for 2025?

I used natural language processing to cluster and map them — really helps spot patterns that weren't obvious when reading predictions one by one. So what will shape journalism next year? A lot of AI and US politics (surprise!), but there's also this horizontal axis that spans from industry strategies to deep reflections on how to talk to the public.

Click any dot to explore the original prediction. What themes surprise/interest you the most?

👉 fdaudens/nieman_lab_2025_predictions_visualization

P.s.: I discovered that Nieman Lab's content is under Creative Commons license!

liked a Space 17 days ago

Running

🌖

Edge LLM Leaderboard

upvoted an article 17 days ago

Article

Finding Moroccan Arabic (Darija) in Fineweb 2

•

26 days ago

• 20

liked a Space 21 days ago

Sleeping

🏢

Social Impact Dashboard

posted an update 21 days ago

Post

2066

🇪🇺 Policy Thoughts in the EU AI Act Implementation 🇪🇺

There is a lot to like in the first draft of the EU GPAI Code of Practice, especially as regards transparency requirements. The Systemic Risks part, on the other hand, is concerning for both smaller developers and for external stakeholders.

I wrote more on this topic ahead of the next draft. TLDR: more attention to immediate large-scale risks and to collaborative solutions supported by evidence can help everyone - as long as developers disclose sufficient information about their design choices and deployment contexts.

Full blog here, based on our submitted response with @frimelle and @brunatrevelin :

https://huggingface.co/blog/yjernite/eu-draft-cop-risks#on-the-proposed-taxonomy-of-systemic-risks

2 replies

upvoted an article 22 days ago

Article

🇪🇺✍️ EU AI Act: Systemic Risks in the First CoP Draft Comments ✍️🇪🇺

•

22 days ago

• 12

reacted to dvilasuero's post with ❤️🔥 28 days ago

Post

2278

🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU

reacted to fdaudens's post with ❤️ 30 days ago

Post

1063

📈👀 Just dropped: visualization mapping Hugging Face's most liked & downloaded models from 2022 to now. Small models are clearly on the rise - fascinating shift in both likes and download patterns.

Check it out: huggingface/open-source-ai-year-in-review-2024

reacted to AdinaY's post with ❤️ 30 days ago

Post

1476

2023 & 2024 Top Downloaded (all time) Open Models on the hub are both from the Chinese community 👀

2023 👉 Bge base by BAAI
BAAI/bge-base-en-v1.5
2024 👉 Qwen 2.5 by Alibaba Qwen
Qwen/Qwen2.5-1.5B-Instruct

Can’t wait to see what incredible models the Chinese community will bring in 2025🚀

✨ Follow https://huggingface.co/zh-ai-community to get the latest updates from the Chinese community
✨ Explore the 2024 Year in Review huggingface/open-source-ai-year-in-review-2024

liked 2 Spaces about 1 month ago

Running on Zero

👁

Olmo Test

Running

493

😻

Open Source Ai Year In Review 2024

What happened in open-source AI this year, and what’s next?

Yacine Jernite

AI & ML interests

Recent Activity

Articles

🇪🇺✍️ EU AI Act: Systemic Risks in the First CoP Draft Comments ✍️🇪🇺

Open Source Developers Guide to the EU AI Act

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Public Policy at Hugging Face

Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️

AI Watermarking 101: Tools and Techniques

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

AI Policy @🤗: Open ML Considerations in the EU AI Act

AI Policy @🤗: Response to the U.S. NTIA's Request for Comment on AI Accountability

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

Ethics and Society Newsletter #3: Ethical Openness at Hugging Face

Ethics and Society Newsletter #2: Let's talk about bias!

Putting ethical principles at the core of research lifecycle

Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets

Organizations

yjernite's activity

PatronusAI/glider

GLIDER

answerdotai/ModernBERT-large

ModernBERT

Jupyter Agent

huggingface/policy-docs

Scaling test-time compute

Edge LLM Leaderboard

Finding Moroccan Arabic (Darija) in Fineweb 2

Social Impact Dashboard

🇪🇺✍️ EU AI Act: Systemic Risks in the First CoP Draft Comments ✍️🇪🇺

Olmo Test

Open Source Ai Year In Review 2024