iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper • 2405.15223 • Published May 24 • 11
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Paper • 2405.15574 • Published May 24 • 52
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Paper • 2405.18669 • Published May 29 • 11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos Paper • 2405.20340 • Published 30 days ago • 19
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM Paper • 2406.02884 • Published 24 days ago • 13
What If We Recaption Billions of Web Images with LLaMA-3? Paper • 2406.08478 • Published 17 days ago • 38
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published 18 days ago • 30
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos Paper • 2406.08407 • Published 17 days ago • 23
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation Paper • 2406.07686 • Published 18 days ago • 13
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published 16 days ago • 17
Explore the Limits of Omni-modal Pretraining at Scale Paper • 2406.09412 • Published 16 days ago • 10
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities Paper • 2406.09406 • Published 16 days ago • 12
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published 15 days ago • 53
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published 17 days ago • 28
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published 12 days ago • 35
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs Paper • 2406.14544 • Published 9 days ago • 33
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents Paper • 2406.13923 • Published 9 days ago • 20
Improving Visual Commonsense in Language Models via Multiple Image Generation Paper • 2406.13621 • Published 10 days ago • 13
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report Paper • 2406.11403 • Published 12 days ago • 4
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters Paper • 2406.16758 • Published 5 days ago • 16
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published 5 days ago • 45
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models Paper • 2406.15704 • Published 7 days ago • 5