How do your annotations for FineWeb2 compare to your teammates'?
I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.
I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates 😂
Do you want to see how your annotations compare to others? 👉 Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations ✍️ Enter the dataset that you've contributed to and your Hugging Face username.
We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard
Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏
At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.
Follow the link below, check if your language is listed and sign up to be a Language Lead!
distilabel 1.3.0 is out! This release contains many core improvements and new tasks that help us building argilla/magpie-ultra-v0.1!
Distributed pipeline execution with Ray, new Magpie tasks, reward models, components for dataset diversity based on sentence embeddings, Argilla 2.0 compatibility and many more features!
Hi everyone! I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!
We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.
Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets
After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.
To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.
As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.
Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.
Would love to answer any questions you have so feel free to add them below!
🌐 Overview: The YouTube Commons Dataset is a comprehensive collection of 30 billion words from 15,112,121 original and automatically translated transcripts, drawn from 2,063,066 videos on YouTube.
🔗 License: All videos are shared under the CC-BY license, with the majority (71%) in English.
🤖 Applications: This dataset is ideal for training powerful AI models for converting speech to text (ASR) and translation models.
📊 Utilization: The text can be used for model training and is republishable for reproducibility purposes.
🤝 Collaboration: This dataset is the result of a collaboration between state start-up LANGU:IA, the French Ministry of Culture, and DINUM. It will be expanded in the coming months.
🔥 Community and Data Quality Are More For Alignment
A recipe to replicate SPIN (Self-Play Fine Tuning) with 30x less data:
🗣️ 50K samples vs 1.8K prompts curated by the 350+ amazing DIBT contributors. ⚗️ Distillation of Mistral Large instead of OpenAI 🙌 Open data & code with ⚗️distilabel