WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines Paper β’ 2410.12705 β’ Published Oct 16 β’ 29 β’ 3
Guiding a Diffusion Model with a Bad Version of Itself Paper β’ 2406.02507 β’ Published Jun 4 β’ 15 β’ 1
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots Paper β’ 2406.02523 β’ Published Jun 4 β’ 10 β’ 1
V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation Paper β’ 2406.02511 β’ Published Jun 4 β’ 9 β’ 2
I4VGen: Image as Stepping Stone for Text-to-Video Generation Paper β’ 2406.02230 β’ Published Jun 4 β’ 16 β’ 3
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models Paper β’ 2406.02430 β’ Published Jun 4 β’ 30 β’ 2
PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs Paper β’ 2406.02886 β’ Published Jun 5 β’ 8 β’ 1
Item-Language Model for Conversational Recommendation Paper β’ 2406.02844 β’ Published Jun 5 β’ 8 β’ 1
Searching Priors Makes Text-to-Video Synthesis Better Paper β’ 2406.03215 β’ Published Jun 5 β’ 11 β’ 2
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM Paper β’ 2406.02884 β’ Published Jun 5 β’ 15 β’ 2
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning Paper β’ 2406.03344 β’ Published Jun 5 β’ 18 β’ 1
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration Paper β’ 2406.01014 β’ Published Jun 3 β’ 31 β’ 2
Block Transformer: Global-to-Local Language Modeling for Fast Inference Paper β’ 2406.02657 β’ Published Jun 4 β’ 37 β’ 1