MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published 16 days ago • 44
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published 21 days ago • 26
Data Engineering for Scaling Language Models to 128K Context Paper • 2402.10171 • Published Feb 15 • 23
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare Paper • 2405.19298 • Published May 29
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper • 2411.13281 • Published Nov 20 • 17