1 2

Banerjee

port8080

port8080

AI & ML interests

datasets

Recent Activity

reacted to jsulz's post with 👍 16 days ago

Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure. To help, I built a Space https://huggingface.co/spaces/jsulz/repo-info that lets you search for any repo and get back: - Treemap of the repository, color coded by file/directory size - Repo branches and their size - Cumulative size of different file types (e.g., the total size of all the safetensors in the repo) And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub - https://huggingface.co/blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes. Some of the treemaps are pretty cool. Attached are https://huggingface.co/black-forest-labs/FLUX.1-dev and for fun https://huggingface.co/datasets/laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

reacted to jsulz's post with 🔥 16 days ago

reacted to jsulz's post with 🔥 about 1 month ago

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in. Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means: ⏩ Only upload the chunks that changed. 🚀 Download just the updates, not the whole file. 🧠 We store your file as deduplicated chunks In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub. We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows? https://huggingface.co/blog/from-files-to-chunks

View all activity

Articles

Rearchitecting Hugging Face Uploads and Downloads

27 days ago

• 37

Organizations

port8080's activity

reacted to jsulz's post with 👍🔥 16 days ago

Post

1289

Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.

To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:

- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)

And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co/blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.

Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

2 replies

reacted to jsulz's post with 🔥 about 1 month ago

Post

2910

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks

reacted to erinys's post with 🚀 2 months ago

Post

2151

🌍 Super cool visualization of global PUT requests to Hugging Face over 24 hours, coded by object size, thanks to @port8080 !

We're putting this analysis to work to help us architect a more geo-distributed system for the HF storage backend.

Originally shared on LinkedIn: https://www.linkedin.com/posts/ajitbanerjee_one-of-the-joys-of-working-on-the-xethub-activity-7252688424732614656-tFGD

New activity in xet-team/lfs-analysis 2 months ago

LFS Analysis Roadmap

#3 opened 2 months ago by

jsulz

upvoted an article 3 months ago

Article

Improving Parquet Dedupe on Hugging Face Hub

Oct 5

• 31

upvoted an article 5 months ago

Article

XetHub is joining Hugging Face!

Aug 8

• 81