Submitted by zhoutianyi 57 CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing · 4 authors 10
Submitted by sinwang 36 World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning · 7 authors 6
Submitted by LucasFang 29 GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing · 12 authors 2
Submitted by agwmon 29 Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models · 5 authors 2
Submitted by Owen777 27 CoRe^2: Collect, Reflect and Refine to Generate Better and Faster · 7 authors 4
Submitted by Weiyun1025 25 VisualPRM: An Effective Process Reward Model for Multimodal Reasoning · 15 authors 3
Submitted by EthanTaylor 20 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models · 8 authors 2
Submitted by yeates 20 OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting · 4 authors 2
Submitted by wondervictor 18 GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding · 10 authors 2
Submitted by ChenyangLyu 16 New Trends for Modern Machine Translation with Large Reasoning Models · 6 authors 2
Submitted by wenhu 15 VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search · 7 authors 2
Submitted by yyf86 14 DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation · 9 authors 2
Submitted by akhaliq 12 Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond · 14 authors 2
Submitted by akhaliq 12 Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k · 32 authors 2
Submitted by VityaVitalich 11 Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark · 6 authors 2
Submitted by akhaliq 10 R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization · 12 authors 3
Submitted by ArthurDouillard 10 Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo · 8 authors 2
Submitted by BestWishYsh 9 CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance · 10 authors 2
Submitted by sayakpaul 9 SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation · 9 authors 4
Submitted by hp-l33 6 Autoregressive Image Generation with Randomized Parallel Decoding · 4 authors 2
Submitted by allisonandreyev 6 Quantization for OpenAI's Whisper Models: A Comparative Analysis · 1 authors 2
Submitted by chenblin26 5 ConsisLoRA: Enhancing Content and Style Consistency for LoRA-based Style Transfer · 6 authors 2
Submitted by AhmadMustafa 5 On the Limitations of Vision-Language Models in Understanding Image Transforms · 3 authors 2
Submitted by gabrielchua 4 MinorBench: A hand-built benchmark for content-based risks for children · 3 authors 3
Submitted by hkchengrex 3 The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation · 2 authors 2
Submitted by jhao 3 TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention · 9 authors 2
Submitted by imranraad 3 "Silent Is Not Actually Silent": An Investigation of Toxicity on Bug Report Discussion · 2 authors 2
Submitted by xzhao 2 Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective · 2 authors 2
Submitted by Jason0214 1 A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 · 5 authors 2
Submitted by Nikolai10 1 PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling · 6 authors 2
Submitted by alandao 1 PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM · 4 authors 2