InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published 3 days ago • 209
view post Post 2588 this paper has been blowing upthey train an open-source multimodal LLM (InternVL3) that can compete with GPT-4o and Claude 3.5 Sonnet by:> training text and vision on a single stage> a novel V2PE positional encoding> SFT & mixed preference optimizationPaper: InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (2504.10479)> test-time scaling See translation ❤️ 5 5 👍 2 2 🔥 2 2 👀 1 1 + Reply
Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 11 items • Updated 17 days ago • 444
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme Paper • 2504.02587 • Published 14 days ago • 30
view post Post 1763 MAYE🎈a from-scratch RL framework for Vision Language Models, released by GAIR - an active research group from the Chinese community.✨Minimal & transparent pipeline with standard tools✨Standardized eval to track training & reflection✨Open Code & Dataset Code: https://github.com/GAIR-NLP/MAYE?tab=readme-ov-fileDataset: ManTle/MAYEPaper: Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme (2504.02587) See translation 1 reply · 👍 4 4 + Reply
view post Post 4640 Qwen 3 can launch very soon. 👀https://github.com/ggml-org/llama.cpp/pull/12828 See translation 3 replies · 🔥 16 16 👀 9 9 ❤️ 8 8 + Reply
view post Post 2604 🚨 Hot Take: GPT-4o might NOT be a purely autoregressive model! 🚨There’s a high chance it has a diffusion head. 🤯 If true, this could be a game-changer for AI architecture. What do you think? 🤔👇Code: https://github.com/PicoTrex/GPT-ImgEvalDataset: Yejy53/GPT-ImgEvalPaper: GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation (2504.02782) See translation 🔥 11 11 👀 4 4 + Reply