BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 23
TextSquare: Scaling up Text-Centric Visual Instruction Tuning Paper • 2404.12803 • Published Apr 19 • 28
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Paper • 2404.13013 • Published Apr 19 • 27
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published Apr 9 • 29
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 57
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Paper • 2404.05726 • Published Apr 8 • 18
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published Apr 4 • 22
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Paper • 2311.16502 • Published Nov 27, 2023 • 33
Kosmos-2: Grounding Multimodal Large Language Models to the World Paper • 2306.14824 • Published Jun 26, 2023 • 34
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published Apr 25 • 49
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Paper • 2404.16375 • Published Apr 25 • 16
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 33
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published 29 days ago • 15
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published 23 days ago • 69
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published 16 days ago • 47
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published 16 days ago • 17
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published 16 days ago • 17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Paper • 2406.08707 • Published 16 days ago • 14
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published 12 days ago • 60
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published 12 days ago • 20
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published 15 days ago • 53