Blindly applying algorithms without understanding the math behind them is not a good idea frmpv. So, I am on a quest to fix this!
I wrote my first hugging face article on how you would derive closed-form solutions for KL-regularised reinforcement learning problems - what is used for DPO.
🔍 From instruction-following to creative storytelling, dive into 2024's most impactful AI datasets! These gems are shaping everything from scientific research to video understanding.
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required. 2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
Hugging face presents FineVideo 🎥! Unlocking the next generation of Video understanding 🚀
🤯3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs. 🔥 @mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.
The cleaning process consists of: - Joining the separate splits together / add split column - Converting string messages into list of structs - Removing empty system prompts
I wanted to introduce myself and my company @Overlaiapp. We are a collective of filmmakers, photographers, and AI engineers working on high resolution (8K+) training data.
We plan to share a lot of our datasets with the community and are kicking things off with two curated datasets:
🎥 Oversampled: Every clip is captured in stunning 8K resolution, delivering rich detail ideal for fine tuning scenic landscapes and ocean dynamics.
📸 Variance: Includes close-up details, slow-motion footage of crashing waves, sweeping landscapes, and wildlife shots.
📋 Detailed Metadata: Every clip is paired with structured metadata, including creative descriptions, precise camera movements, lens information, field of view calculations, and shot settings, ensuring AI models can fully understand and replicate real-world cinematography with accuracy.
⚙️ Consistency: Re-thinking training data at the point of capture by "overshooting" a subject, enabling models to learn more nuanced relationships and views across scenes.
🌅 Light: Shot during early morning and sunset light for optimal color contrast and dynamic range, maximizing visual quality for color and lighting-sensitive tasks.
🔍 Curation: Curated specifically for machine learning, providing clean, high-quality data for next generation model training.
Microsoft researchers dropped a groundbreaking technique that could slash the energy use in transformer computations : their novel "linear-complexity multiplication" (L-Mul) algorithm approximates floating-point multiplication using energy-efficient integer addition instead of costly multiplications.
💡 Quick reminder on how floats are coded on 8 bits (FP8): In the e4m3 FP8 standard, you encode a number as: Sign (1 bit) | Exponent (4 bits) | Mantissa (3 bits) Example: 0 (positive) | 1000 (8) | 101 (1/2 + 1/8 = 0.625) Calculation: you add one to the mantissa, and multiply it by 2 power (the exponent - a bias term which is 7 for e4m3):
➡️ You get (1 + 0.625) × 2^(8-7) = 3.25
Now back to the paper. 𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
⚡️ Multiplication is extremely energy-intensive compared to addition. For 32-bit operations, multiplication (3.7 pJ) uses 37x more energy than addition (0.1 pJ)!
🧮 Traditional floating-point multiplication go like (noting xm the mantissa and xe the exponent): Mul(x,y) = (1 + xm) · 2^xe · (1 + ym) · 2^ye = (1 + xm + ym + xm · ym) · 2^(xe+ye)
💡 L-Mul cleverly approximates this as: L-Mul(x,y) = (1 + xm + ym + 2^-l(m)) · 2^(xe+ye), eliminating the costly xm · ym term
🔧 l(m) term is adaptively set based on mantissa size for optimal accuracy
📊 Benchmarks on the Llama-3.1-8B-Instruct model show L-Mul preserves precision across various NLP tasks, with performance nearly identical to full BFloat16 precision
💬 Authors claim: "We can achieve the same model inference performance while reducing the energy cost of attention computations by 80%."
This breakthrough is still theoretical and would need implementation on dedicated hardware to confirm real-world gains, but it’s a really exciting path for more sustainable AI! 🌱
🔗 Evaluating Long Context #1: Long Range Arena (LRA)
Accurately evaluating how well language models handle long contexts is crucial, but it's also quite challenging to do well. In this series of posts, we're going to examine the various benchmarks that were proposed to assess long context understanding, starting with Long Range Arens (LRA)
Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation.
📌 Key Features of LRA
1️⃣ Diverse Tasks: The LRA benchmark consists of a suite of tasks designed to evaluate model performance on long sequences ranging from 1,000 to 16,000 tokens. These tasks encompass different data types and modalities: Text, Natural and Synthetic Images, and Mathematical Expressions.
2️⃣ Synthetic and Real-world Tasks: LRA is comprised of both synthetic probing tasks and real-world tasks.
3️⃣ Open-Source and Extensible: Implemented in Python using Jax and Flax, the LRA benchmark code is publicly available, making it easy to extend.
📌 Tasks
1️⃣ Long ListOps
2️⃣ Byte-level Text Classification and Document Retrieval
3️⃣ Image Classification
4️⃣ Pathfinder and Pathfinder-X (Long-range spatial dependency)
We're thrilled to announce the release of Argilla 2.2.0, packed with powerful new features to enhance your data annotation and LLM workflow:
🗨️ ChatField: Work with text conversations natively in Argilla. Perfect for building datasets for conversational LLMs! ⚙️ Adjustable Task Distribution: Modify settings on the fly and automatically recalculate completed and pending records. 📊 Progress Tracking: Monitor annotation progress directly from the SDK, including user-specific metrics. 🧠 Automatic Settings Inference: Importing datasets from Hugging Face Hub just got easier with automatic settings detection. 📋 Task Templates: Jump-start your projects with pre-built templates for common dataset types. 🔧 Background Jobs Support: Improved performance for long-running tasks (requires Redis).
It is an LLM controlled Rogue-Like in which the LLM gets a markdown representation of the map, and should generate a JSON with the objective to fulfill on the map as well as the necessary objects and their placements.
Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. 📚
This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:
- Exact-match deduplication across all crawls - Embeddings for each row using the TaylorAI/bge-micro model - Count column indicating duplication frequency - Includes data from 95 Common Crawl crawls (2013-2024) - Rows have been reduced from 1.279B to 0.324B after deduplication - It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)
Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks to @underspirit for pointing out the reduction in dataset size that could be achieved via deduplication. 🤗