โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Multimodal ๐ฌ - We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG ๐ - UI-TARS are new models by ByteDance to unlock agentic GUI control ๐คฏ in 2B, 7B and 72B - Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B - MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context - Dataset: Yale released a new benchmark called MMVU - Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs ๐ - DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! ๐คฏ - Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B - NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio ๐ฃ๏ธ - Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B - TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation โฏ๏ธ - Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux - tencent released Hunyuan3D-2, new 3D asset generation from images
๐ฅ ๐๐ผ๐ผ๐ด๐น๐ฒ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ ๐๐ฒ๐บ๐ถ๐ป๐ถ ๐ฎ.๐ฌ, ๐๐๐ฎ๐ฟ๐๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฎ ๐๐น๐ฎ๐๐ต ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ต๐ฎ๐ ๐๐๐ฒ๐ฎ๐บ๐ฟ๐ผ๐น๐น๐ ๐๐ฃ๐ง-๐ฐ๐ผ ๐ฎ๐ป๐ฑ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐ฏ.๐ฒ ๐ฆ๐ผ๐ป๐ป๐ฒ๐! And they start a huge effort on agentic capabilities.
๐ The performance improvements are crazy for such a fast model: โฃ Gemini 2.0 Flash outperforms the previous 1.5 Pro model at twice the speed โฃ Now supports both input AND output of images, video, audio and text โฃ Can natively use tools like Google Search and execute code
โก๏ธ If the price is on par with previous Flash iteration ($0.30 / M tokens, to compare with GPT-4o's $1.25) the competition will have a big problem with this 4x cheaper model that gets better benchmarks ๐คฏ
๐ค What about the agentic capabilities?
โฃ Project Astra: A universal AI assistant that can use Google Search, Lens and Maps โฃ Project Mariner: A Chrome extension that can complete complex web tasks (83.5% success rate on WebVoyager benchmark, this is really impressive!) โฃ Jules: An AI coding agent that integrates with GitHub workflows
I'll be eagerly awaiting further news from Google!
Multimodal ๐ผ๏ธ > Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants ๐ > OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license โจ > Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs ๐ฌ > Meta released a new iteration of Llama 70B, Llama3.2-70B trained further > EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license ๐ฅ > Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license > Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models > Dataset: FineWeb2 just landed with multilinguality update! ๐ฅ nearly 8TB pretraining data in many languages!
Image/Video Generation ๐ผ๏ธ > Tencent released HunyuanVideo, a new photorealistic video generation model > OminiControl is a new editing/control framework for image generation models like Flux
Audio ๐ > Indic-Parler-TTS is a new text2speech model made by community
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! ๐ฅ High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. ๐ค Try it out yourself!
A team from NUS and Microsoft just released an agent that can act on any UI (Desktop, Android, Web) without needing additional text information. It works extremely well : they applied their method on a tiny Qwen2-VL-2B, and they managed to beat methods that use either much more powerful vision models (like GPT-4V) without using any additional info (e.g. leveraging the DOM of a webpage) like previous methods did ! ๐๐
They started from the idea that most existing methods rely heavily on text, which makes them less generalizable, while letting aside rich UI structure that user actually rely on when navigating this interfaces.
โ๏ธ They put several good ideas to work:
๐ก Simplify screenshots to the max: They prune a lot the heavy visual content of UI screenshots, by removing cloned image patches (like any vast patch of the same color will be reduced to a small patch, while maintaining positional embeddings), then group patches from the same GUI elements together to simplify even further
๐ก Build a truly generalist dataset: To train a general UI agent, you need trajectories from each possible UI, and express them in a common language. Authors merge datasets like OmniAct for Desktop, Mind2Web for websites, AMEX for Android trajectories to create a high-quality and diverse dataset.
โก๏ธ Nice results ensued: They fine-tune a tiny Qwen-2-VL-2B on their method, and it reaches SOTA on several task (element identification, web navigation), even beating methods that either use additional info from the DOM or use much bigger VLMS like GPT-4v! ๐
And performance could certainly jump with a slightly bigger vision model. Let's hope the community builds this soon! ๐
We just released Transformers.js v3.1 and you're not going to believe what's now possible in the browser w/ WebGPU! ๐คฏ Let's take a look: ๐ Janus from Deepseek for unified multimodal understanding and generation (Text-to-Image and Image-Text-to-Text) ๐๏ธ Qwen2-VL from Qwen for dynamic-resolution image understanding ๐ข JinaCLIP from Jina AI for general-purpose multilingual multimodal embeddings ๐ LLaVA-OneVision from ByteDance for Image-Text-to-Text generation ๐คธโโ๏ธ ViTPose for pose estimation ๐ MGP-STR for optical character recognition (OCR) ๐ PatchTST & PatchTSMixer for time series forecasting
That's right, everything running 100% locally in your browser (no data sent to a server)! ๐ฅ Huge for privacy!
How does it work ? - You give an URL - The AI assistant crawls the website content and embed it - Add it to your frontend in one line of code - People on your website can ask the assistant questions