Yacine Jernite

yjernite

AI & ML interests

Technical, community, and regulatory tools of AI governance @HuggingFace

Recent Activity

Articles

Organizations

yjernite's activity

Reacted to cfahlgren1's post with ❤️ 4 days ago
view post
Post
2862
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
·
Reacted to fdaudens's post with 🔥 12 days ago
view post
Post
1826
Fascinating point from @thomwolf at Web Summit: AI misuse (deepfakes, fake news) is actually easier to make with closed models, not with open-source ones.

This challenges the common narrative that open-source AI is inherently more dangerous. The reality is more nuanced - while we may think open source is technically easier to misuse, closed models' accessibility and product-focused design appear to be driving more actual harm.

Important context for current AI safety discussions and regulation debates.

Do you agree? 👇
  • 1 reply
·
upvoted an article 12 days ago
view article
Article

Releasing the largest multilingual open pretraining dataset

94
upvoted an article 25 days ago
liked a Space 27 days ago
updated a Space about 1 month ago