URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics Paper • 2501.04686 • Published 2 days ago • 40
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published 3 days ago • 38
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Paper • 2501.02955 • Published 4 days ago • 37
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution Paper • 2501.02976 • Published 4 days ago • 44
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Paper • 2501.01904 • Published 7 days ago • 27
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper • 2501.01957 • Published 7 days ago • 32
MLLM-as-a-Judge for Image Safety without Human Labeling Paper • 2501.00192 • Published 11 days ago • 23
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published 9 days ago • 91
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Paper • 2412.18619 • Published 26 days ago • 51
On the Compositional Generalization of Multimodal LLMs for Medical Imaging Paper • 2412.20070 • Published 13 days ago • 43
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization Paper • 2412.18525 • Published 17 days ago • 65
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks Paper • 2412.18072 • Published 18 days ago • 16
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation Paper • 2412.18176 • Published 18 days ago • 15
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Paper • 2412.18609 • Published 17 days ago • 15
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search Paper • 2412.18319 • Published 17 days ago • 35
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding Paper • 2412.17295 • Published 19 days ago • 9
Diving into Self-Evolving Training for Multimodal Reasoning Paper • 2412.17451 • Published 18 days ago • 42
FastVLM: Efficient Vision Encoding for Vision Language Models Paper • 2412.13303 • Published 24 days ago • 13
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception Paper • 2412.14233 • Published 23 days ago • 6