Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Recent Activity

View all activity

Articles

Organizations

m-ric's activity

posted an update 3 days ago
view post
Post
1033
Made a new app to visualize the LLM race โ‡’ ๐—ก๐—ผ ๐—˜๐˜‚๐—ฟ๐—ผ๐—ฝ๐—ฒ๐—ฎ๐—ป ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ป๐˜† ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐˜๐—ผ๐—ฝ ๐Ÿญ๐Ÿฌ ๐Ÿ‡ช๐Ÿ‡บโŒ

See the app here ๐Ÿ‘‰ m-ric/llm-race-to-the-top

I've adapted an app by @andrewrreed that tracks progress of LLMs ( andrewrreed/closed-vs-open-arena-elo), on the Chatbot Arena leaderboard, to compare companies from different countries.

The outcome is quite sad, as a Frenchman and European.

The top 10 is exclusively US ๐Ÿ‡บ๐Ÿ‡ธ and Chinese ๐Ÿ‡จ๐Ÿ‡ณ companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI ๐Ÿ‡ซ๐Ÿ‡ท.

American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up ๐Ÿ˜ฌ

โš ๏ธ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
posted an update 4 days ago
view post
Post
753
๐—ก๐—ฒ๐˜„ ๐—น๐—ฒ๐—ฎ๐—ฑ๐—ฒ๐—ฟ๐—ฏ๐—ผ๐—ฎ๐—ฟ๐—ฑ ๐—ฟ๐—ฎ๐—ป๐—ธ๐˜€ ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ: ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ-๐Ÿฏ.๐Ÿญ-๐Ÿณ๐Ÿฌ๐—• ๐˜๐—ผ๐—ฝ๐˜€ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฎ๐—ป๐—ธ๐—ถ๐—ป๐—ด๐˜€! ๐Ÿง‘โ€โš–๏ธ

Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.

First, what is "LLM-as-a-judge"?
๐Ÿ‘‰ It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilร , you get your score.

๐Ÿง But who judges the judge?
How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.

๐Ÿ“Š Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!

And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!

Take a deeper look here ๐Ÿ‘‰ https://huggingface.co/blog/arena-atla
posted an update 4 days ago
view post
Post
238
Lifehack of the day:
Adding "r.jina.ai/" before any url transforms it in Markdown using Jina AI's Reader! Here with @cyrilzakka 's blog post.
Reacted to cfahlgren1's post with โค๏ธ 5 days ago
view post
Post
2866
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
ยท
posted an update 5 days ago
view post
Post
763
๐Ÿ” Meta teams use a fine-tuned Llama model to fix production issues in seconds

One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.

๐Ÿค” 42%, is that not too low?
โžก๏ธ Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack.
๐Ÿ’ก So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!

How did they do it?

๐Ÿ”„ Two-step approach:
โ€ฃ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set
โ€ฃ Fine-tuned Llama 2 7B ranks the most likely culprits

๐ŸŽ“ Training pipeline:
โ€ฃ Continued pre-training on Meta's internal docs and wikis
โ€ฃ Supervised fine-tuning on past incident investigations
โ€ฃ Training data mimicked real-world constraints (2-20 potential changes per incident)

๐Ÿ”ฎ Now future developments await:
โ€ฃ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems)
โ€ฃ Improvements in model reasoning should boost accuracy further

Read it in full ๐Ÿ‘‰ https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
Reacted to reach-vb's post with ๐Ÿ”ฅ 6 days ago
view post
Post
4017
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! ๐Ÿค—
posted an update 6 days ago
view post
Post
1338
Great feature alert: ๐—ฌ๐—ผ๐˜‚ ๐—ฐ๐—ฎ๐—ป ๐—ป๐—ผ๐˜„ ๐˜‚๐˜€๐—ฒ ๐—ฎ๐—ป๐˜† ๐—ฆ๐—ฝ๐—ฎ๐—ฐ๐—ฒ ๐—ฎ๐˜€ ๐—ฎ ๐˜๐—ผ๐—ผ๐—น ๐—ณ๐—ผ๐—ฟ ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฒ๐—ฟ๐˜€.๐—ฎ๐—ด๐—ฒ๐—ป๐˜! ๐Ÿ› ๏ธ๐Ÿ”ฅ๐Ÿ”ฅ

This lets you take the coolest spaces, like FLUX.1-dev, and use them in agentic workflows with a few lines of code! ๐Ÿง‘โ€๐Ÿ’ป

On the video below, I set up my fake vacation pictures where I'm awesome at surfing (I'm really not) ๐Ÿ„

Head to the doc to learn this magic ๐Ÿ‘‰ https://huggingface.co/docs/transformers/main/en/agents_advanced#import-a-space-as-a-tool-
posted an update 10 days ago
view post
Post
368
๐— ๐—ฒ๐˜๐—ฎ ๐˜๐—ฒ๐—ฎ๐—บ ๐—ท๐˜‚๐˜€๐˜ ๐—ฑ๐—ฟ๐—ผ๐—ฝ๐—ฝ๐—ฒ๐—ฑ ๐˜๐—ต๐—ฒ ๐—ณ๐—ถ๐—ฟ๐˜€๐˜ ๐—ช๐—ฎ๐˜๐—ฒ๐—ฟ๐—บ๐—ฎ๐—ฟ๐—ธ๐—ถ๐—ป๐—ด ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ต๐—ฎ๐˜ ๐—ป๐—ผ๐˜ ๐—ฒ๐—ฑ๐—ถ๐˜ ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฟ๐—ฒ๐—ฎ๐—ธ!๐Ÿ›ก๏ธ

๐Ÿค” Ever heard of watermarking? It's a technique that allows you to mark in an image its original source. It's our best shield against AI-generated deepfakes, or content stolen from artists! ๐ŸŽจ

๐ŸŽญ Watermarking systems are actually a pair of models: a watermark embedder that applies the watermark on the image, and its corresponding decoder that should detect the original watermark.

โ›” But current methods were very limited: they can only apply and detect the watermark on your image as a whole. So, if you're an attacker it's easy to break: just crop it! add text on top! or whatever, really, anything would work to break the watermark.

A team of researchers at Meta was not happy with this. ๐Ÿ˜ค

So to withstand real-world attacks, they decided to make a watermarking model that would also work on any sub-part of the image. It's a real paradigm shift: they consider watermarking not as an image classification task, but as an image segmentation task!

๐Ÿ—๏ธ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ
โ–ธ The "Embedder" (a variational autoencoder + embedder, 1.1M parameters in total) encodes a n-bit message into a watermark signal that is added to the original image
โ–ธ [Only during training] The "Augmenter" randomly distorts the image: masks parts, crops, resizes, compresses. It's basically torture at this point.
โ–ธ The "Extractor" (a vision transformer, or ViT, with 96M parameters) then re-extracts the message from the distorted image, by predicting a (1+n) vector per pixel to predict the watermarked parts and decode corresponding messages.

The performance blows existing models out of the water, they even created new tasks (segmentation-related) just to grok them!

Gerat work @pierrefdz and @tomsander1998 !

Paper here ๐Ÿ‘‰ Watermark Anything with Localized Messages (2411.07231)
Reacted to maxiw's post with โค๏ธ๐Ÿš€๐Ÿ”ฅ 11 days ago
view post
Post
4550
I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
โค๏ธ: 7048 times
๐Ÿ”ฅ: 5921 times
๐Ÿ‘: 4856 times
๐Ÿš€: 2549 times
๐Ÿค—: 2065 times
ยท
posted an update 11 days ago
view post
Post
3667
๐—ง๐—ต๐—ฒ ๐—ป๐—ฒ๐˜…๐˜ ๐—ฏ๐—ถ๐—ด ๐˜€๐—ผ๐—ฐ๐—ถ๐—ฎ๐—น ๐—ป๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ ๐—ถ๐˜€ ๐—ป๐—ผ๐˜ ๐Ÿฆ‹, ๐—ถ๐˜'๐˜€ ๐—›๐˜‚๐—ฏ ๐—ฃ๐—ผ๐˜€๐˜๐˜€! [INSERT STONKS MEME WITH LASER EYES]

See below: I got 105k impressions since regularly posting Hub Posts, coming close to my 275k on Twitter!

โš™๏ธ Computed with the great dataset maxiw/hf-posts
โš™๏ธ Thanks to Qwen2.5-Coder-32B for showing me how to access dict attributes in a SQL request!

cc @merve who's far in front of me
ยท
posted an update 13 days ago
view post
Post
2344
A non-Instruct LLM assistant is mostly useless. ๐Ÿง

Since it's mostly a model trained to complete text, when you ask it a question like "What to do during a stopover in Paris?", it can just go on and on adding more details to your question instead of answering, which would be valid to complete text from its training corpus, but not to answer questions.

โžก๏ธ So the post-training stage includes an important Instruction tuning step where you teach your model how to be useful : answer questions, be concise, be polite... RLHF is a well known technique for this.

For people interested to understand how this step works, the folks at Adaptive ML have made a great guide!

Read it here ๐Ÿ‘‰ https://www.adaptive-ml.com/post/from-zero-to-ppo
posted an update 14 days ago
view post
Post
3151
๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐—–๐—ผ๐—ฑ๐—ฒ๐—ฟ-๐Ÿฏ๐Ÿฎ๐—•: ๐—ป๐—ฒ๐˜„ ๐—ฏ๐—ฒ๐˜€๐˜-๐—ถ๐—ป-๐—ฐ๐—น๐—ฎ๐˜€๐˜€ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น, ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐—š๐—ฃ๐—ง-๐Ÿฐ๐—ผ ๐—ผ๐—ป ๐—บ๐—ผ๐˜€๐˜ ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€!๐Ÿ’ฅ

๐Ÿ’ช It's the first time Open-Source coding model of this size class that clearly matches GPT-4o's coding capabilities!

โœจ Completes the previous two Qwen 2.5 Coder release with 4 new size: 0.5B, 3B, 14B, 32B
๐Ÿ“š Support long context up to 128K (for the 14B and 32B models)
โœ… Drop-in replacement to GPT-4o as a coding assistant on Cursor or for Artifacts!
๐Ÿค— Models available right now on the Hub, under Apache 2.0 license!

They have setup a crazy Artifacts demo, you should go have a look!
๐Ÿ‘‰ Qwen/Qwen2.5-Coder-Artifacts
posted an update 14 days ago
view post
Post
777
๐—”๐—ฟ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐—น๐—ฎ๐˜„๐˜€ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ? ๐—” ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜๐—ต๐—ฒ ๐—œ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ป๐—ป๐—ผ๐˜‚๐—ป๐—ฐ๐—ฒ๐—ฑ ๐˜๐—ต๐—ฎ๐˜ ๐—ข๐—ฝ๐—ฒ๐—ป๐—”๐—œ ๐—ถ๐˜€ ๐˜€๐—ฒ๐—ฒ๐—ถ๐—ป๐—ด ๐—ฑ๐—ถ๐—บ๐—ถ๐—ป๐—ถ๐˜€๐—ต๐—ถ๐—ป๐—ด ๐—ฟ๐—ฒ๐˜๐˜‚๐—ฟ๐—ป๐˜€ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐˜‚๐—ฝ ๐˜๐—ต๐—ฒ ๐—ป๐—ฒ๐˜…๐˜ ๐—š๐—ฃ๐—ง ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€.

๐Ÿ“Š What are scaling laws? These are empiric laws that say "Every time you increase compute spent in training 10-fold, your LLM's performance will go up by a predictable tick". Of course, they apply only if you train your model with the right methods.

The image below illustrates it: they're from a paper by Google, "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", and they show how quality and instruction following of models improve when you scale the model up (which is equivalent to scaling up the compute spent in training).

โžก๏ธ These scaling laws have immense impact: they triggered the largest gold rush ever, with companies pouring billions into scaling up theiur training. Microsoft and OpenAI spent 100B into their "Startgate" mega training cluster, due to start running in 2028.

๐Ÿค” So, what about these reports of scaling laws slowing down?

If they are true, they would mean a gigantic paradigm shift, as the hundreds of billions poured by AI companies into scaling could be a dead-end. โ›”๏ธ

But I doubt it: until the most recent publications, scaling laws showed no signs of weakness, and the researchers at the higher end of the scale-up seems to imply the scaling up continues.

Wait and see!
  • 1 reply
ยท
posted an update 17 days ago
view post
Post
1618
๐—”๐—ป๐—ฑ๐—ฟ๐—ผ๐—ถ๐—ฑ๐—Ÿ๐—ฎ๐—ฏ: ๐—™๐—ถ๐—ฟ๐˜€๐˜ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—”๐—ป๐—ฑ๐—ฟ๐—ผ๐—ถ๐—ฑ ๐—บ๐—ผ๐—ฏ๐—ถ๐—น๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐˜€๐—บ๐—ฎ๐—น๐—น, ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ฒ๐—ฑ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ฐ๐—ฎ๐—ป ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ ๐—ฎ ๐—๐—”๐—ฅ๐—ฉ๐—œ๐—ฆ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜€๐—บ๐—ฎ๐—ฟ๐˜๐—ฝ๐—ต๐—ผ๐—ป๐—ฒ ๐Ÿ“ฑ๐Ÿ”ฅ

A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.

They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.

The team built:

๐Ÿ“Šย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically

๐Ÿ“๐Ÿ“ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces

โœ…ย An instruction dataset of 10.5k operation traces for training mobile agents

Key insights:

- ๐Ÿ“ˆ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโ€™s much smaller
- โš™๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.

Read their paper here ๐Ÿ‘‰ AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents (2410.24024)
posted an update 20 days ago
view post
Post
2488
๐—›๐˜‚๐—ป๐˜†๐˜‚๐—ฎ๐—ป-๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—ฏ๐˜† ๐—ง๐—ฒ๐—ป๐—ฐ๐—ฒ๐—ป๐˜: ๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ๐˜€๐˜ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐— ๐—ผ๐—˜ ๐—Ÿ๐—Ÿ๐— , ๐—ผ๐—ป๐—น๐˜† ๐Ÿฑ๐Ÿฎ๐—• ๐—ฎ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ๐˜€ ๐—ฏ๐˜‚๐˜ ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐—Ÿ๐—Ÿ๐—ฎ๐— ๐—” ๐Ÿฏ.๐Ÿญ-๐Ÿฐ๐Ÿฌ๐Ÿฑ๐—• ๐—ผ๐—ป ๐—บ๐—ผ๐˜€๐˜ ๐—ฎ๐—ฐ๐—ฎ๐—ฑ๐—ฒ๐—บ๐—ถ๐—ฐ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€ ๐Ÿš€

โšก Mixture of Experts (MoE) architecture: 389 B parameters in total, but only 52B are activated for any input

๐Ÿงช Trained on 7T tokens, including 1.5T tokens of synthetic data

๐Ÿ—๏ธ Architecture : Novel "recycle routing" prevents token dropping when experts are overrloaded

๐Ÿ“Š Great benchmark results: Surpasses Llama-3-405B-Instruct in most benchmarks although it has 8x fewer active parameters
โ€ฃ Impressive perf on MATH: 77.4

๐Ÿ‹ย Large context length: up to 256K tokens

๐Ÿ”’ License:
โ€ฃ Commercial use allowed, except if your products have >100M monthly active users
โ€ฃ No access in the EU

๐Ÿค—ย Model weights available on HF!

Read the full paper here ๐Ÿ‘‰ย  Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (2411.02265)
posted an update 23 days ago
view post
Post
1541
๐Ÿง ย  ๐—–๐—Ÿ๐—˜๐—”๐—ฅ: ๐—ณ๐—ถ๐—ฟ๐˜€๐˜ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐˜๐—ผ ๐—บ๐—ฎ๐—ธ๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ณ๐—ผ๐—ฟ๐—ด๐—ฒ๐˜ ๐˜„๐—ต๐—ฎ๐˜ ๐˜„๐—ฒ ๐˜„๐—ฎ๐—ป๐˜ ๐˜๐—ต๐—ฒ๐—บ ๐˜๐—ผ ๐—ณ๐—ผ๐—ฟ๐—ด๐—ฒ๐˜

With privacy concerns rising, we sometimes need our models to "forget" specific information - like a person's data - while keeping everything else intact. Researchers just released CLEAR, the first benchmark to test how well this works with both text and images.

โŒย Bad news: Current methods either fail to truly forget or end up forgetting way too much. It's like trying to remove a single ingredient from a baked cake!

โœจย But there's hope: Adding simple mathematical constraints (L1 regularization) during the forgetting process significantly improves results.

๐ŸŽฏย Key insights:

โœ…ย The benchmark tests forgetting on 200 fictional personas
โ€ฃ 3,770 visual Q&A pairs
โ€ฃ 4,000 textual Q&A pairs
โ€ฃ Additional real-world tests

๐Ÿ›‘ย Most current forgetting methods don't work well with both text and images
โ€ฃ They either remember what they should forget
โ€ฃ Or they forget too much unrelated information

โœจย Simple mathematical constraints work surprisingly well
โ€ฃ L1 regularization prevents excessive forgetting
โ€ฃ Works especially well with the LLMU method

๐Ÿ‘‰ย Read the full paper here: CLEAR: Character Unlearning in Textual and Visual Modalities (2410.18057)
posted an update 24 days ago
view post
Post
2348
> Oasis: First Real-Time Video Game Without a Game Engine! ๐ŸŽฎ

DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.

โšก๏ธ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.

โš™๏ธ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.

Key insights:
โšก๏ธ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models
โ€ฃ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100

๐ŸŽฎ Features real game mechanics
โ€ฃ Movement, jumping, item management
โ€ฃ Physics and lighting
โ€ฃ Procedurally generated worlds

โš ๏ธ Current limitations
โ€ฃ Blurry graphics at a distance
โ€ฃ Objects sometimes change appearance
โ€ฃ Memory issues in long sessions

Try it yourself, the playable demo is impressive! ๐Ÿ‘‰ https://oasis.decart.ai/welcome
Code ๐Ÿ‘‰ https://github.com/etched-ai/open-oasis
Read it in full ๐Ÿ‘‰ https://oasis-model.github.io/
posted an update 28 days ago
view post
Post
1824
I'm very proud to have supported @CGIAR and @Digigreen in making http://Farmer.chat, an app that supports 20k smallholder farmers on a daily basis ๐ŸŒพ

There are ~500 million smallholder farmers globally, playing a critical role in global food security. Having access to accurate information is essential for them.

๐Ÿ’ฌ An โ€œagricultural extension serviceโ€ offers technical advice on agriculture, and also supplies farmers with the necessary inputs and services to support their agricultural production.

But agriculture extension agents are not in large enough numbers to cope with all the requests, especially in countries like Kenya, India, Ethiopia, and Nigeria.

๐Ÿš€ So the team set out to build an app called http://Farmer.Chat, to provide an agricultural extension service, by building on the immense knowledge accumulated by CGIAR.

โœจ The app is technically impressive: behind the Whatsapp-type UX, an agent interprets the user's intent, and identifies which tool to call to best answer their request: weather API, RAG on a CGIAR-provided knowledge base, market data, etc. The RAG on the knowledge base is in itself a work of art.

๐ŸŽฏ A key part of building such a complex system is to be able to evaluate it properly. During our bi-weekly sessions with the team, I could support them in implementing the method called "LLM-as-a-judge" to tackle this problem.

It worked really well : thanks to the amazing work of the team, the app now successfully answered over 300 thousand requests, in 6 different languages, and it keeps growing!

โžก๏ธ @Vinsingh , @rajgreen and I just wrote a blog post to describe how the app works, especially the LLM-as-a-judge system!

Read it here ๐Ÿ‘‰ https://huggingface.co/blog/digital-green-llm-judge