9 19 50

Jared Sulzdorf PRO

jsulz

https://www.jsulz.com/

AI & ML interests

NLP + (Law|Medicine) & Ethics

Recent Activity

reacted to julien-c's post with 🤗 21 days ago

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise) docs: https://huggingface.co/docs/hub/storage-limits We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥 cc: @reach-vb @pierric @victor and the HF team

reacted to julien-c's post with ❤️ 21 days ago

reacted to julien-c's post with 🔥 21 days ago

View all activity

Articles

Rearchitecting Hugging Face Uploads and Downloads

Nov 26

• 37

From Files to Chunks: Improving Hugging Face Storage Efficiency

Nov 20

• 45

Organizations

jsulz's activity

reacted to julien-c's post with 🤗❤️🔥 21 days ago

Post

7799

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

28 replies

reacted to dvilasuero's post with 🔥❤️ 25 days ago

Post

2274

🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU

reacted to fdaudens's post with 🧠 25 days ago

Post

1376

The viz of the day for the Year in review: Network graph showing likes similarity between models.

Instructive to see which models serve as the "nodes" of the Hub!

Check it out: huggingface/open-source-ai-year-in-review-2024

replied to their post 25 days ago

I thought big and complex repos would be fun to visualize and they can be! This image is from blanchon/RESISC45, a repo with 31,000 images from Google Earth, each bucketed into one of 45 taxonomies with 700 images per taxonomy:

But more fun is when you find a repository that is structured (naming conventions and directories) in a way that lets you see the inequity in the bytes.

This is most apparent in NLP datasets that are multilingual, similar to the wikimedia/wikipedia dataset. If you zoom in on any of these (or run them yourself in the Space) you'll see a directory or file naming convention using the language abbreviation. Sections that near yellow for directories or files == more bytes devoted to that language.

Here's facebook/multilingual_librispeech:

and mozilla-foundation/common_voice_17_0:

and google/xtreme:

and unolp/CulturaX:

Each dataset shows some imbalance in the languages represented, and this pattern holds true for other types of datasets as well. However, such discrepancies can be harder to spot when folder or file naming conventions prioritize machine over human readability.

Another fun example is the nguha/legalbench dataset, designed to evaluate legal reasoning in LLMs. It provides a clear view of the types of reasoning being tested:

Although you might have to squint to see the labels. This is one where it might be best to head over to the Space https://huggingface.co/spaces/jsulz/repo-info and see it for yourself ;)

replied to their post 25 days ago

Datasets are among my favorite to visualize because of their mixture of files and folder structures. Here's the huggingface/documentation-images where alongside documentation images we store images for the Hugging Face blog:

I also enjoy the wikimedia/wikipedia dataset. It's fascinating to see the distribution of bytes across languages.

Some datasets are actually quite difficult to visualize because the number of points in the Plotly graph cause the browser to crash on render. It's quite possible you'll run into this if you use the Space. A simple check for file count could help, but for now I find myself running it a few times just to see if I can grab the image. allenai is home to many such datasets, but I eventually found allenai/paloma a eval dataset, that I could visualize

For some of these larger datasets, I might run things locally and write the image out to see if there are any interesting findings.

posted an update 26 days ago

Post

1304

Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.

To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:

- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)

And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co/blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.

Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

2 replies

reacted to clem's post with 🔥🚀 28 days ago

Post

4280

Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):

- There will be the first major public protest related to AI
- A big company will see its market cap divided by two or more because of AI
- At least 100,000 personal AI robots will be pre-ordered
- China will start to lead the AI race (as a consequence of leading the open-source AI race).
- There will be big breakthroughs in AI for biology and chemistry.
- We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.

How my predictions for 2024 turned out:

- A hyped AI company will go bankrupt or get acquired for a ridiculously low price
✅ (Inflexion, AdeptAI,...)

- Open-source LLMs will reach the level of the best closed-source LLMs
✅ with QwQ and dozens of others

- Big breakthroughs in AI for video, time-series, biology and chemistry
✅ for video 🔴for time-series, biology and chemistry

- We will talk much more about the cost (monetary and environmental) of AI
✅Monetary 🔴Environmental (😢)

- A popular media will be mostly AI-generated
✅ with NotebookLM by Google

- 10 millions AI builders on Hugging Face leading to no increase of unemployment
🔜currently 7M of AI builders on Hugging Face

4 replies

reacted to cfahlgren1's post with 👍🔥🚀 28 days ago

Post

3008

We just dropped an LLM inside the SQL Console 🤯

The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset ✨

It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think 🤗

reacted to fdaudens's post with 🚀❤️ 29 days ago

Post

1741

Keeping up with open-source AI in 2024 = overwhelming.

Here's help: We're launching our Year in Review on what actually matters, starting today!

Fresh content dropping daily until year end. Come along for the ride - first piece out now with @clem 's predictions for 2025.

Think of it as your end-of-year AI chocolate calendar.

Kudos to @BrigitteTousi @clefourrier @Wauplin @thomwolf for making it happen. We teamed up with aiworld.eu for awesome visualizations to make this digestible—it's a charm to work with their team.

Check it out: huggingface/open-source-ai-year-in-review-2024

reacted to prithivMLmods's post with 🤗❤️🔥 about 1 month ago

Post

3280

HF Posts Receipts 🏆🚀

[ HF POSTS RECEIPT ] : prithivMLmods/HF-POSTS-RECEIPT

🥠The one thing that needs to be remembered is the 'username'.

🥠And yeah, thank you, @maxiw , for creating the awesome dataset and sharing them here! 🙌

🥠[ Dataset ] : maxiw/hf-posts

.
.
.
@prithivMLmods

replied to their post about 1 month ago

Great question, we've talked about torrents before, actually!

How would you include torrents in your workflows today?

There's nothing stopping us from doing it, but the user/developer experience doesn't quite align with what we're trying to support right now. There are benefits to leveraging CDNs as we do today, and this integrates relatively seamlessly with existing clients (e.g., huggingface_hub) that are used across the Hub.

Maybe if there's enough interest in the future!