QvQ-72B-Preview🎄 an open weight model for visual reasoning just released by Alibaba_Qwen team Qwen/qvq-676448c820912236342b9888 ✨ Combines visual understanding & language reasoning. ✨ Scores 70.3 on MMMU ✨ Outperforms Qwen2-VL-72B-Instruct in complex problem-solving
Megrez-3B-Omni 🔥 an on-device multimodal LLM by Infinigence AI, another startup emerging from the Tsinghua University ecosystem. Model: Infinigence/Megrez-3B-Omni Demo: Infinigence/Megrez-3B-Omni ✨Supports analysis of image, text, and audio modalities ✨Leads in bilingual speech ( English & Chinese ) input, multi-turn conversations, and voice-based queries ✨Outperforms in scene understanding and OCR across major benchmarks
LLaMA-O1-PRM and LLaMA-O1-Reinforcement will release in this weekend. We have implemented a novel Reinforcement finetune(RFT) pipeline that taught models learning reasoning and reward labeling without human annotation.
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
Last week was crazy in OS AI, with important models and datasets releases every day.
Here are the most important ones I've pinned:
🌎 Cohere relased GLobal-MMLU, a multilingual version of MMLU, to evaluate AI models' world knowledge in many languages!
🦙 Meta released Llama-3.3-70B-Instruct, a 70B model that's on par with Llama-3.1-405B-Instruct, GPT-4o and Claude. Probably my new go-to for agentic workflows.
🔉 FishAudio released fish-speech-1.5, multilingual text to speech model
🎨 Microsoft Research released TRELLIS, an extremely impressive image-to-3D model, which you can try here: JeffreyXiang/TRELLIS
📚 Yesterday, Hugging Face release FineWeb 2, a new version that extends the previous FineWeb to over 1000 languages, including extended coverage in Russina, Mandarin, German, Japanese, Spanish, French, so a huge, high-quality dataset of > 3 trillion words! HuggingFaceFW/fineweb-2
Now let's go build to make this week as productive as last one!
Open Preference Dataset for Text-to-Image Generation by the 🤗 Community
Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.
We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.
🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!
Audio model: ✨Fish Speech 1.5, Text-to-speech in 13 languages, trained on 1M+ hours of audio by FishAudio fishaudio/fish-speech-1.5 ✨ClearVoice, An advanced voice processing framework by Alibaba Tongyi SpeechAI https://huggingface.co/alibabasglab
HunyuanVideo 📹 The new open video generation model by Tencent! 👉 tencent/HunyuanVideo zh-ai-community/video-models-666afd86cfa4e4dd1473b64c ✨ 13B parameters: Probably the largest open video model to date ✨ Unified architecture for image & video generation ✨ Powered by advanced features: MLLM Text Encoder, 3D VAE, and Prompt Rewrite ✨ Delivers stunning visuals, diverse motion, and unparalleled stability 🔓 Fully open with code & weights
Zhipu AI, the Chinese generative AI startup behind CogVideo, just launched their first productized AI Agent - AutoGLM 🔥 👉 https://agent.aminer.cn
With simple text or voice commands, it: ✨ Simulates phone operations effortlessly ✨ Autonomously handles 50+ step tasks ✨ Seamlessly operates across apps
Powered by Zhipu's "Decoupled Interface" and "Self-Evolving Learning Framework" to achieve major performance gains in Phone Use and Web Browser Use!
Meanwhile, GLM4-Edge is now on Hugging Face hub🚀 👉 THUDM/glm-edge-6743283c5809de4a7b9e0b8b Packed with advanced dialogue + multimodal models: 📱 1.5B / 2B models: Built for mobile & in-car systems 💻 4B / 5B models: Optimized for PCs