VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published 3 days ago • 64
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published 3 days ago • 64
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Nov 15, 2024 • 113
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper • 2410.17243 • Published Oct 22, 2024 • 89
MoH: Multi-Head Attention as Mixture-of-Head Attention Paper • 2410.11842 • Published Oct 15, 2024 • 21 • 2
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Paper • 2303.09867 • Published Mar 17, 2023
Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation Paper • 2303.13399 • Published Mar 23, 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Paper • 2303.14369 • Published Mar 25, 2023