Banerjee's picture
1 2

Banerjee

port8080

AI & ML interests

datasets

Recent Activity

View all activity

Articles

Organizations

Hugging Face's profile picture Xet Team's profile picture

port8080's activity

reacted to jsulz's post with πŸ‘πŸ”₯ 16 days ago
view post
Post
1289
Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.

To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:

- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)

And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co/blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.

Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files 🀯)

  • 2 replies
Β·
reacted to jsulz's post with πŸ”₯ about 1 month ago
view post
Post
2910
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
πŸš€ Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks
reacted to erinys's post with πŸš€ 2 months ago
New activity in xet-team/lfs-analysis 2 months ago

LFS Analysis Roadmap

#3 opened 2 months ago by
jsulz
upvoted an article 3 months ago
view article
Article

Improving Parquet Dedupe on Hugging Face Hub

β€’ 31
upvoted an article 5 months ago
view article
Article

XetHub is joining Hugging Face!

β€’ 81