Dataset Tools

community

AI & ML interests

Tools for creating and exploring datasets

Recent Activity

Dataset-Tools's activity

davanstrienΒ 
posted an update about 5 hours ago
view post
Post
155
Introducing FineWeb-C πŸŒπŸŽ“, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
fdaudensΒ 
posted an update 1 day ago
view post
Post
550
πŸ” From instruction-following to creative storytelling, dive into 2024's most impactful AI datasets! These gems are shaping everything from scientific research to video understanding.

Check it out: huggingface/open-source-ai-year-in-review-2024
prithivMLmodsΒ 
posted an update 1 day ago
view post
Post
859
Qwen2VL Models: Vision and Language Processing πŸ‰

πŸ“FT; [ Latex OCR, Math Parsing, Text Analogy OCRTest ]

❄️Demo : prithivMLmods/Qwen2-VL-2B . The demo includes the Qwen2VL 2B Base Model.

🎯The space handles documenting content from the input image along with standardized plain text. It includes adjustment tools with over 30 font styles, file formatting support for PDF and DOCX, textual alignments, font size adjustments, and line spacing modifications.

πŸ“„PDFs are rendered using the ReportLab software library toolkit.

🧡Models :
+ prithivMLmods/Qwen2-VL-OCR-2B-Instruct
+ prithivMLmods/Qwen2-VL-Ocrtest-2B-Instruct
+ prithivMLmods/Qwen2-VL-Math-Prase-2B-Instruct

πŸš€Sample Document :
+ https://drive.google.com/file/d/1Hfqqzq4Xc-3eTjbz-jcQY84V5E1YM71E/view?usp=sharing

πŸ“¦Collection :
+ prithivMLmods/vision-language-models-67639f790e806e1f9799979f

.
.
.
@prithivMLmods πŸ€—
davidberenstein1957Β 
posted an update 1 day ago
prithivMLmodsΒ 
posted an update 2 days ago
view post
Post
2894
πŸŽ„ Here Before - XmasπŸŽ…βœ¨

πŸ§‘πŸ»β€πŸŽ„Models
+ [ Xmas 2D Illustration ] : strangerzonehf/Flux-Xmas-Illustration-LoRA
+ [ Xmas 3D Art ] : strangerzonehf/Flux-Xmas-3D-LoRA
+ [ Xmas Chocolate ] : strangerzonehf/Flux-Xmas-Chocolate-LoRA
+ [ Xmas Isometric Kit ] : strangerzonehf/Flux-Xmas-Isometric-Kit-LoRA
+ [ Xmas Realpix ] : strangerzonehf/Flux-Xmas-Realpix-LoRA
+ [ Xmas Anime ] : strangerzonehf/Flux-Anime-Xmas-LoRA

❄️Collections
+ [ Xmas Art ] : strangerzonehf/christmas-pack-6758b199487adafaddb68f82
+ [ Stranger Zone Collection ] : prithivMLmods/stranger-zone-collections-org-6737118adcf2cb40d66d0c7e

πŸ₯ΆPage
+ [ Stranger Zone ] : https://huggingface.co/strangerzonehf


.
.
.
@prithivMLmods πŸ€—
fdaudensΒ 
posted an update 2 days ago
view post
Post
1048
🀝 Want to share your AI models while protecting your work? Licenses are key!

Fascinating to see that nearly 60% of models on the Hub use Apache & MIT licenses.

Explore the viz here: huggingface/open-source-ai-year-in-review-2024
fdaudensΒ 
posted an update 3 days ago
view post
Post
1225
Did a fun experiment: What are the main themes emerging from the 100+ Nieman Journalism Lab predictions for 2025?

I used natural language processing to cluster and map them β€” really helps spot patterns that weren't obvious when reading predictions one by one. So what will shape journalism next year? A lot of AI and US politics (surprise!), but there's also this horizontal axis that spans from industry strategies to deep reflections on how to talk to the public.

Click any dot to explore the original prediction. What themes surprise/interest you the most?

πŸ‘‰ fdaudens/nieman_lab_2025_predictions_visualization

P.s.: I discovered that Nieman Lab's content is under Creative Commons license!
nataliaElvΒ 
posted an update 3 days ago
view post
Post
1558
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU
davidberenstein1957Β 
posted an update 4 days ago
view post
Post
4030
Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator
Β·
fdaudensΒ 
posted an update 6 days ago
prithivMLmodsΒ 
posted an update 7 days ago
alielfilali01Β 
posted an update 7 days ago
view post
Post
3249
Unpopular opinion: Open Source takes courage to do !

Not everyone is brave enough to release what they have done (the way they've done it) to the wild to be judged !
It really requires a high level of "knowing wth are you doing" ! It's kind of a super power !

Cheers to the heroes here who see this!
Β·
lhoestqΒ 
posted an update 8 days ago
view post
Post
1586
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
πŸ”— Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
fdaudensΒ 
posted an update 9 days ago
nataliaElvΒ 
posted an update 9 days ago
view post
Post
1230
How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates πŸ˜‚


Do you want to see how your annotations compare to others?
πŸ‘‰ Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
davidberenstein1957Β 
posted an update 11 days ago
view post
Post
2039
Open Preference Dataset for Text-to-Image Generation by the πŸ€— Community

Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.

https://huggingface.co/blog/image-preferences
alielfilali01Β 
posted an update 12 days ago
view post
Post
1444
Apparently i forgot to put this here !

Well, this is a bit late but consider given our recent blog a read if you are interested in Evaluation.

You don't have to be into Arabic NLP in order to read it, the main contribution we are introducing is a new evaluation measure for NLG. We made the fisrt application of this measure on Arabic for now and we will be working with colleagues from the community to expand it to other languages.

Blog:
Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
https://huggingface.co/blog/leaderboard-3c3h-aragen

Space:
inceptionai/AraGen-Leaderboard

Give it a read and let me know your thoughts πŸ€—
fdaudensΒ 
posted an update 14 days ago
view post
Post
1368
The viz of the day for the Year in review: Network graph showing likes similarity between models.

Instructive to see which models serve as the "nodes" of the Hub!

Check it out: huggingface/open-source-ai-year-in-review-2024