Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models Paper • 2412.05939 • Published 9 days ago • 11
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation Paper • 2412.04432 • Published 11 days ago • 12
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Paper • 2410.13360 • Published Oct 17 • 8
Grounding Descriptions in Images informs Zero-Shot Visual Recognition Paper • 2412.04429 • Published 11 days ago
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published 15 days ago • 25
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Paper • 2411.18203 • Published 20 days ago • 29
Towards Interpreting Visual Information Processing in Vision-Language Models Paper • 2410.07149 • Published Oct 9 • 1
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published Jul 2 • 21
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy Paper • 2411.15453 • Published 24 days ago
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Paper • 2411.14982 • Published 24 days ago • 14