-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 23 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 28 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 27 -
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 29
Collections
Discover the best community collections!
Collections including paper arxiv:2405.02246
-
Visual Instruction Tuning
Paper • 2304.08485 • Published • 10 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 6 -
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 33 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper • 2310.13355 • Published • 5
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 123 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 47 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 9 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 63
-
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 18 -
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 73 -
DragAnything: Motion Control for Anything using Entity Representation
Paper • 2403.07420 • Published • 12 -
Learning and Leveraging World Models in Visual Representation Learning
Paper • 2403.00504 • Published • 26
-
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
Paper • 2401.13313 • Published • 4 -
BAAI/Bunny-v1_0-4B
Text Generation • Updated • 103 • 7 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 91 -
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper • 2405.20204 • Published • 27
-
Neural Network Diffusion
Paper • 2402.13144 • Published • 94 -
Genie: Generative Interactive Environments
Paper • 2402.15391 • Published • 68 -
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 88 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 40
-
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 75 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 26 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 63 -
Poro 34B and the Blessing of Multilinguality
Paper • 2404.01856 • Published • 12