dataset-viber (Dataset Viber)

davidberenstein1957

posted an update about 8 hours ago

Post

218

Fine-tune a SmolLM on domain-specific synthetic data from a LLM

Blog: https://huggingface.co/blog/davidberenstein1957/fine-tune-a-smollm-on-synthetic-data-of-llm

davidberenstein1957

posted an update 5 days ago

Post

1850

Fine-tuning ModernBERT for text classification using synthetic data generation

From prompt to model in 3 steps.
1 dataset description
20 minutes of generating
60 minutes of fine-tuning on my Macbook Pro

Tutorial: https://nbsanity.com/static/552eb50cbd91bedb4e5b73fddca2664a/fine-tune-modernbert-classifier.html

davidberenstein1957

posted an update 16 days ago

Post

1343

🐇 Tumble down the AI rabbit hole without any technical knowledge!

Explore AI models on the Hub by a simple and quick search

Demo: davidberenstein1957/transformers-pipeline-playground

davidberenstein1957

posted an update 19 days ago

Post

4165

Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator

4 replies

·

davidberenstein1957

posted an update 26 days ago

Post

2065

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.

https://huggingface.co/blog/image-preferences

davidberenstein1957

posted an update about 1 month ago

Post

1185

This is amazing for cheap models fine-tunes without the hassle of actual deployment! TIL: LoRA fine-tunes for models on the Hub can directly be used for inference!

davidberenstein1957

posted an update about 1 month ago

Post

3435

The Data Is Better Together community is set to release the first Apache 2 licensed image preference dataset!

Great work and let's give this a final push :)

@aashish1904 congrats on your month of HF pro. There is more to win during this sprint!

@aashish1904 @AnyaDesdein @davidberenstein1957 @Malalatiana @beta3 @fffiloni @munish0838 @Reza2kn @bbunzeck @Creazycreator @andrei-saceleanu @jafhaponiuk @rca-etl @kf120 @burtenshaw @mmhamdy @grib0ed0v @Doopus @AnyaDes @ttkap @Xceron @Lewox @davanstrien @Azazelle @adirik @Ashish08 @AntonVic @kenantang @sdiazlor @g-ronimo @dennis-rall @prithivMLmods @girtss3 @flozi00 @WaveCut @Taylor658 @Wildminder @Sara9999 @phaelishall @sararob @dvilasuero @pgabrys @plaguss @CDS899 @timajwilliams @rudzinskimaciej @pavel-ai @aggr8 @ignacioct @MouseAI @Leeps @MaksKul @NicolasDmln @Muinez @kusht55 @caiolang @Jakub-Brand24 @loamy @Demijan @eliab96 @Viewegger @JosephCatrambone @p1atdev @mrshu @o639 @Targezed @Aviv-anthonnyolime @thliang01 @Ahmed-Amine @glards @pranaykoppula @nataliaElv @MaPirlet @alvarobartt @gabrielmbmb @zlicastro @Jaydip @Chouettecheveche @lilcheaty @ruyrdiaz @robintema @fdaudens @ggcristian @a-r-r-o-w @pates @joheras @stopsatgreen @bezo97 @chachi902 @iamyann @liamcripwell @dmb23 @korbih @anonymous7743 @akbdx18 @OVAWARE @severo @akontra @lichorosario @lhoestq @SebastianBodza @Vishnou @ameerazam08 @appoose @Mukei @mearco @joaquincabezas @Fizzarolli @thomastraum @igortopolski @OxxoCodes @patrickfleith @asoria @bn22 @sitammeur @Krodolf @bergr7f @Sbxxn @wietsevenema @sugatoray @Iamladi @MikeTrizna @feveromo @mokady @Bolero @prath @Dowwie @kfahn @decodingchris @alili2050 @RahulRaman @yzimmermann @Ameeeee @ecyht2 @MattMC001 @hemanthkumarak @Thegorgibus @akos2 @LawRun @ramithuh @SuperMuel @sjans @peterizsak @mosama @Eyel @mtr3 @cfahlgren1 @legentil @clem @Citaman @Aurelien-Morgan @AntoineBourgois @TotoB12 @Stanmey @osanseviero @multimodalart @maxiw @ariG23498 @ngk89 @femboysLover @dvs @tacohiddink @blanchon @DavidJimenez

1 reply

·

davidberenstein1957

posted an update about 1 month ago

Post

1575

🔥 Dataset Drop - Open Image Preferences

BlackForest Labs Flux Dev VS. Stability AI Stable Diffusion Large 3.5

Together with the ⁠data-is-better-together community, we've worked on an Apache 2.0 licensed open image preference dataset based on the fal ai imgsys prompts dataset. Thanks to the awesome community, we have managed to get 5K preference pairs in less than 2 days. The annotation alignment among annotators is great too.

Aashish Kumar won a month of Hugging Face Pro by making the most contributions! Congrats from the entire team 🥇

The best thing?! We are not done yet! Let's keep the annotations coming for 5K more in the second part of the sprint! (with more prices to go around).

Dataset: https://huggingface.co/datasets/data-is-better-together/image-preferences-results

davidberenstein1957

posted an update about 1 month ago

Post

1705

Let’s make a generation of amazing image-generation models

The best image generation models are trained on human preference datasets, where annotators have selected the best image from a choice of two. Unfortunately, many of these datasets are closed source so the community cannot train open models on them. Let’s change that!

The community can contribute image preferences for an open-source dataset that could be used for building AI models that convert text to image, like the flux or stable diffusion families. The dataset will be open source so everyone can use it to train models that we can all use.

Blog: https://huggingface.co/blog/burtenshaw/image-preferences

davidberenstein1957

posted an update about 1 month ago

Post

969

Watch and learn!

Let's observe Qwen2.5-coder:0.5b on OpenAI HumanEval.

pip install observers

And start collecting your data on the Hugging Face Hub.
Dataset: davidberenstein1957/openai_records
Library: https://github.com/cfahlgren1/observers

davidberenstein1957

posted an update about 1 month ago

Post

1074

🤗🔭 Introducing Observers: A Lightweight SDK for AI Observability 🔭🤗

Observers is an open-source Python SDK that provides comprehensive observability for AI applications. Our library makes it easy to:

- Track and record interactions with AI models
- Store observations in multiple backends
- Query and analyse your AI interactions with ease

https://huggingface.co/blog/davidberenstein1957/observers-a-lightweight-sdk-for-ai-observability

davidberenstein1957

posted an update about 2 months ago

Post

1970

For anyone who struggles with NER or information extraction with LLM.

We showed an efficient workflow for token classification including zero-shot suggestions and model fine-tuning with Argilla, GliNER, the NuMind NuExtract LLM and SpanMarker. @argilla

Video: https://youtu.be/JvLpaYgNd84?feature=shared
Notebooks and slides included to try it yourself 🙂

davidberenstein1957

posted an update 2 months ago

Post

2089

Import any dataset from the Hub and configure your labeling tasks without needing any code!

Really excited about extending the Hugging Face Hub integration with many more streamlined features and workflows, and we would love to hear your feedback and ideas, so don't feel shy and reach out 🫶🏽

https://huggingface.co/blog/argilla-ui-hub

davidberenstein1957

posted an update 2 months ago

Post

3093

Vector Search (most) datasets on the Hugging Face Hub 🔦

Powered by: Polars, DuckDB, Gradio and model2vec (lightning-fast embeddings by Stéphan Tulkens).

Should work fast enough for datasets up to 100K.

davidberenstein1957/vectorsearch-hub-datasets

davidberenstein1957

posted an update 2 months ago

Post

1747

⚡️ LLMs do a good job at NER, but don't you want to do learn how to do more with less?

Go from 🐢 -> 🐇

If you want a small model to perform well on your problem, you need to fine-tune it.

Bootstrap with a teacher model.

Correct potential mistakes to get high-quality data.

Fine-tune your student model

Go more accurate and more efficient.

Free signup: https://lu.ma/zx2t7irs

davidberenstein1957

posted an update 3 months ago

Post

1694

You can now build a custom text classifier without days of human labeling!

👍 LLMs work reasonably well as text classifiers.
👎 They are expensive to run at scale and their performance drops in specialized domains.

👍 Purpose-built classifiers have low latency and can potentially run on CPU.
👎 They require labeled training data.

Combine the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts to train and deploy a specialized model.

Blog: https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback

davidberenstein1957

posted an update 3 months ago

Post

684

The Synthetic Data Generator now directly integrates with Argilla, so you can generate and curate your own high-quality datasets from pure natural language!

Up next -> include dataset generation for text classification.
Other suggestions? Let us know.

Space: argilla/synthetic-data-generator

davidberenstein1957

posted an update 3 months ago

Post

2501

Don't use an LLM when you can use a much cheaper model.

The problem is that no one tells you how to actually do it.

Just picking a pre-trained model (e.g., BERT) and throwing it at your problem won't work!

If you want a small model to perform well on your problem, you need to fine-tune it.

And to fine-tune it, you need data.

The good news is that you don't need a lot of data but instead high-quality data for your specific problem.

In the latest livestream, I showed you guys how to get started with Argilla on the Hub! Hope to see you at the next one.

https://www.youtube.com/watch?v=BEe7shiG3rY

davidberenstein1957

posted an update 3 months ago

Post

1216

Thursday 10 October 17:00 CEST, I will show a good way to get started with a text classification project on the Hugging Face Hub with Argilla and Setfit.

Signup here: https://lu.ma/31mecp34

davidberenstein1957

posted an update 3 months ago

Post

1136

Why is argilla/FinePersonas-v0.1 great for synthetic data generation? It can be used to synthesise realistic and diverse data of the customer personas your company is interested in!

Dataset: argilla/FinePersonas-v0.1
Example usage: https://distilabel.argilla.io/dev/sections/pipeline_samples/examples/fine_personas_social_network/

1 reply

·

Dataset Viber

AI & ML interests

Recent Activity

dataset-viber's activity

AI & ML interests

Recent Activity

Team members 1

dataset-viber's activity