VisionLM - a zerozeyi Collection

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 26

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Paper • 2402.05935 • Published Feb 8, 2024 • 17

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Paper • 2402.06118 • Published Feb 9, 2024 • 15

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Paper • 2402.07456 • Published Feb 12, 2024 • 44

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Paper • 2402.07872 • Published Feb 12, 2024 • 16

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Paper • 2402.07865 • Published Feb 12, 2024 • 15

World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13, 2024 • 38

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Paper • 2402.10896 • Published Feb 16, 2024 • 16

FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models

Paper • 2402.10986 • Published Feb 16, 2024 • 78

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19, 2024 • 43

CoLLaVO: Crayon Large Language and Vision mOdel

Paper • 2402.11248 • Published Feb 17, 2024 • 22

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Paper • 2402.11690 • Published Feb 18, 2024 • 10

VideoPrism: A Foundational Visual Encoder for Video Understanding

Paper • 2402.13217 • Published Feb 20, 2024 • 24

Video ReCap: Recursive Captioning of Hour-Long Videos

Paper • 2402.13250 • Published Feb 20, 2024 • 26

A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 15

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Paper • 2402.13220 • Published Feb 20, 2024 • 15

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Paper • 2402.13577 • Published Feb 21, 2024 • 10

PALO: A Polyglot Large Multimodal Model for 5B People

Paper • 2402.14818 • Published Feb 22, 2024 • 23

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22, 2024 • 19

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27, 2024 • 87

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29, 2024 • 33

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 27

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Paper • 2403.01487 • Published Mar 3, 2024 • 15

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Paper • 2403.02677 • Published Mar 5, 2024 • 18

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

Paper • 2403.02626 • Published Mar 5, 2024 • 11

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Paper • 2403.03194 • Published Mar 5, 2024 • 14

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Paper • 2403.03003 • Published Mar 5, 2024 • 11

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14, 2024 • 126

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 75

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Paper • 2403.07750 • Published Mar 12, 2024 • 23

DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 15

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Paper • 2403.06764 • Published Mar 11, 2024 • 27

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 28

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Paper • 2403.05135 • Published Mar 8, 2024 • 42

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8, 2024 • 63

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 43

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8, 2024 • 20

Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Paper • 2403.10301 • Published Mar 15, 2024 • 52

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 33

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Paper • 2403.11703 • Published Mar 18, 2024 • 17

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Paper • 2403.12895 • Published Mar 19, 2024 • 32

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

Paper • 2403.12596 • Published Mar 19, 2024 • 10

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Paper • 2403.14624 • Published Mar 21, 2024 • 52

Can large language models explore in-context?

Paper • 2403.15371 • Published Mar 22, 2024 • 32

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 23

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Paper • 2403.15360 • Published Mar 22, 2024 • 12

VidLA: Video-Language Alignment at Scale

Paper • 2403.14870 • Published Mar 21, 2024 • 13

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27, 2024 • 54

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 47

sDPO: Don't Use Your Data All at Once

Paper • 2403.19270 • Published Mar 28, 2024 • 41

TextCraftor: Your Text Encoder Can be Image Quality Controller

Paper • 2403.18978 • Published Mar 27, 2024 • 15

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

Paper • 2403.20331 • Published Mar 29, 2024 • 16

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Paper • 2404.01197 • Published Apr 1, 2024 • 31

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Paper • 2404.01258 • Published Apr 1, 2024 • 12

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 26

LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Paper • 2404.03118 • Published Apr 3, 2024 • 24

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Paper • 2404.03653 • Published Apr 4, 2024 • 34

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 82

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8, 2024 • 21

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Paper • 2404.05674 • Published Apr 8, 2024 • 14

Koala: Key frame-conditioned long video-LLM

Paper • 2404.04346 • Published Apr 5, 2024 • 6

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30

Adapting LLaMA Decoder to Vision Transformer

Paper • 2404.06773 • Published Apr 10, 2024 • 18

BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Paper • 2404.07448 • Published Apr 11, 2024 • 12

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper • 2404.07973 • Published Apr 11, 2024 • 32

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Paper • 2404.09990 • Published Apr 15, 2024 • 13

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Paper • 2404.09204 • Published Apr 14, 2024 • 11

On Speculative Decoding for Multimodal Large Language Models

Paper • 2404.08856 • Published Apr 13, 2024 • 14

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18, 2024 • 39

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Paper • 2404.14239 • Published Apr 22, 2024 • 9

A Multimodal Automated Interpretability Agent

Paper • 2404.14394 • Published Apr 22, 2024 • 21

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 30

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 31

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 27

Editable Image Elements for Controllable Synthesis

Paper • 2404.16029 • Published Apr 24, 2024 • 11

MoDE: CLIP Data Experts via Clustering

Paper • 2404.16030 • Published Apr 24, 2024 • 13

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Paper • 2404.16790 • Published Apr 25, 2024 • 8

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 56

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Paper • 2404.16375 • Published Apr 25, 2024 • 17

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25, 2024 • 36

HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

Paper • 2404.16845 • Published Feb 14, 2024 • 7

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

Paper • 2404.17672 • Published Apr 26, 2024 • 19

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Paper • 2404.17521 • Published Apr 26, 2024 • 13

Automatic Creative Selection with Cross-Modal Matching

Paper • 2405.00029 • Published Feb 28, 2024 • 9

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 102

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Paper • 2405.07990 • Published May 13, 2024 • 20

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Paper • 2405.08344 • Published May 14, 2024 • 15

Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14, 2024 • 19

SpeechVerse: A Large-scale Generalizable Audio Language Model

Paper • 2405.08295 • Published May 14, 2024 • 19

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Paper • 2405.08317 • Published May 14, 2024 • 13

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Paper • 2405.09215 • Published May 15, 2024 • 22

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15, 2024 • 88

Many-Shot In-Context Learning in Multimodal Foundation Models

Paper • 2405.09798 • Published May 16, 2024 • 29

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16, 2024 • 130

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 28

Toon3D: Seeing Cartoons from a New Perspective

Paper • 2405.10320 • Published May 16, 2024 • 21

Octo: An Open-Source Generalist Robot Policy

Paper • 2405.12213 • Published May 20, 2024 • 27

Imp: Highly Capable Large Multimodal Models for Mobile Devices

Paper • 2405.12107 • Published May 20, 2024 • 28

Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19, 2024 • 151

Diffusion for World Modeling: Visual Details Matter in Atari

Paper • 2405.12399 • Published May 20, 2024 • 30

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Paper • 2405.14129 • Published May 23, 2024 • 13

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

Paper • 2405.13195 • Published May 21, 2024 • 11

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 53

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Paper • 2405.15216 • Published May 24, 2024 • 14

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 87

Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 31

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Paper • 2405.17428 • Published May 27, 2024 • 18

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24, 2024 • 44

Dense Connector for MLLMs

Paper • 2405.13800 • Published May 22, 2024 • 23

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Paper • 2405.14598 • Published May 23, 2024 • 13

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30, 2024 • 35

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Paper • 2405.18669 • Published May 29, 2024 • 12

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30, 2024 • 20

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31, 2024 • 22

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Paper • 2406.00888 • Published Jun 2, 2024 • 31

Parrot: Multilingual Visual Instruction Tuning

Paper • 2406.02539 • Published Jun 4, 2024 • 36

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Paper • 2406.02884 • Published Jun 5, 2024 • 16

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 73

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

Paper • 2406.04151 • Published Jun 6, 2024 • 19

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Paper • 2406.01014 • Published Jun 3, 2024 • 32

Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10, 2024 • 26

An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11, 2024 • 57

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Paper • 2406.06911 • Published Jun 11, 2024 • 11

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11, 2024 • 35

What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12, 2024 • 39

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Paper • 2406.08407 • Published Jun 12, 2024 • 27

Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11, 2024 • 53

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Paper • 2406.11839 • Published Jun 17, 2024 • 38

VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17, 2024 • 24

TroL: Traversal of Layers for Large Language and Vision Models

Paper • 2406.12246 • Published Jun 18, 2024 • 35

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18, 2024 • 30

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Paper • 2406.12742 • Published Jun 18, 2024 • 15

Adversarial Attacks on Multimodal Agents

Paper • 2406.12814 • Published Jun 18, 2024 • 4

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Paper • 2406.11230 • Published Jun 17, 2024 • 34

Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models

Paper • 2406.12649 • Published Jun 18, 2024 • 16

Understanding Hallucinations in Diffusion Models through Mode Interpolation

Paper • 2406.09358 • Published Jun 13, 2024 • 5

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Paper • 2406.09356 • Published Jun 13, 2024 • 5

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13, 2024 • 15

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13, 2024 • 21

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Paper • 2406.09411 • Published Jun 13, 2024 • 20

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Paper • 2406.08707 • Published Jun 13, 2024 • 16

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Paper • 2406.09162 • Published Jun 13, 2024 • 14

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 29

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12, 2024 • 24

Are We Done with MMLU?

Paper • 2406.04127 • Published Jun 6, 2024 • 38

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

Paper • 2406.06523 • Published Jun 10, 2024 • 51

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Paper • 2406.08487 • Published Jun 12, 2024 • 13

VCR: Visual Caption Restoration

Paper • 2406.06462 • Published Jun 10, 2024 • 12

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51

OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 37

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Paper • 2406.08552 • Published Jun 12, 2024 • 25

Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6, 2024 • 37

Hibou: A Family of Foundational Vision Transformers for Pathology

Paper • 2406.05074 • Published Jun 7, 2024 • 8

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Paper • 2406.10210 • Published Jun 14, 2024 • 77

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Paper • 2406.08973 • Published Jun 13, 2024 • 87

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17, 2024 • 62

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Paper • 2406.11831 • Published Jun 17, 2024 • 22

From Pixels to Prose: A Large Dataset of Dense Image Captions

Paper • 2406.10328 • Published Jun 14, 2024 • 18

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 35

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Paper • 2406.11069 • Published Jun 16, 2024 • 14

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Paper • 2406.11271 • Published Jun 17, 2024 • 21

Task Me Anything

Paper • 2406.11775 • Published Jun 17, 2024 • 8

Unifying Multimodal Retrieval via Document Screenshot Embedding

Paper • 2406.11251 • Published Jun 17, 2024 • 10

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 66

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Paper • 2406.14515 • Published Jun 20, 2024 • 33

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

Paper • 2406.14035 • Published Jun 20, 2024 • 13

ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Paper • 2406.14596 • Published Jun 20, 2024 • 5

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Paper • 2406.11403 • Published Jun 17, 2024 • 4

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Paper • 2406.16338 • Published Jun 24, 2024 • 26

Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24, 2024 • 33

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24, 2024 • 60

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Paper • 2406.17770 • Published Jun 25, 2024 • 19

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Paper • 2406.15704 • Published Jun 22, 2024 • 5

Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 48

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 29

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Paper • 2406.15334 • Published Jun 21, 2024 • 9

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Paper • 2406.17294 • Published Jun 25, 2024 • 11

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27, 2024 • 53

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26, 2024 • 42

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Paper • 2406.18790 • Published Jun 26, 2024 • 34

Simulating Classroom Education with LLM-Empowered Agents

Paper • 2406.19226 • Published Jun 27, 2024 • 31

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Paper • 2406.10900 • Published Jun 16, 2024 • 11

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Paper • 2406.20095 • Published Jun 28, 2024 • 18

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28, 2024 • 9

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25, 2024 • 8

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Paper • 2407.01284 • Published Jul 1, 2024 • 77

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Paper • 2406.19741 • Published Jun 28, 2024 • 60

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Paper • 2407.00468 • Published Jun 29, 2024 • 35

ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 43

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Paper • 2407.00114 • Published Jun 27, 2024 • 12

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 22

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 93

TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2, 2024 • 21

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 51

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Paper • 2406.08085 • Published Jun 12, 2024 • 13

Granular Privacy Control for Geolocation with Vision Language Models

Paper • 2407.04952 • Published Jul 6, 2024 • 4

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Paper • 2407.06135 • Published Jul 8, 2024 • 21

Multi-Object Hallucination in Vision-Language Models

Paper • 2407.06192 • Published Jul 8, 2024 • 10

Vision language models are blind

Paper • 2407.06581 • Published Jul 9, 2024 • 83

VIMI: Grounding Video Generation through Multi-modal Instruction

Paper • 2407.06304 • Published Jul 8, 2024 • 10

Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10, 2024 • 17

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4, 2024 • 19

Understanding Visual Feature Reliance through the Lens of Complexity

Paper • 2407.06076 • Published Jul 8, 2024 • 5

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Paper • 2407.06723 • Published Jul 9, 2024 • 11

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 68

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10, 2024 • 40

Do Vision and Language Models Share Concepts? A Vector Space Alignment Study

Paper • 2302.06555 • Published Feb 13, 2023 • 9

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Paper • 2407.08303 • Published Jul 11, 2024 • 18

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Paper • 2407.08583 • Published Jul 11, 2024 • 12

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9, 2024 • 44

E5-V: Universal Embeddings with Multimodal Large Language Models

Paper • 2407.12580 • Published Jul 17, 2024 • 40

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper • 2407.12679 • Published Jul 17, 2024 • 8

AUITestAgent: Automatic Requirements Oriented GUI Function Testing

Paper • 2407.09018 • Published Jul 12, 2024 • 5

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Paper • 2407.11298 • Published Jul 16, 2024 • 5

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Paper • 2407.12366 • Published Jul 17, 2024 • 4

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

Paper • 2406.07057 • Published Jun 11, 2024 • 16

EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19, 2024 • 43

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

VideoGameBunny: Towards vision assistants for video games

Paper • 2407.15295 • Published Jul 21, 2024 • 22

CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model

Paper • 2407.15233 • Published Jul 21, 2024 • 6

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 27

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23, 2024 • 30

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23, 2024 • 13

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 40

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Paper • 2407.15815 • Published Jul 22, 2024 • 14

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Paper • 2407.17490 • Published Jul 3, 2024 • 31

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Paper • 2407.18121 • Published Jul 25, 2024 • 17

VSSD: Vision Mamba with Non-Casual State Space Duality

Paper • 2407.18559 • Published Jul 26, 2024 • 19

Wolf: Captioning Everything with a World Summarization Framework

Paper • 2407.18908 • Published Jul 26, 2024 • 32

Diffusion Feedback Helps CLIP See Better

Paper • 2407.20171 • Published Jul 29, 2024 • 36

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Paper • 2407.19795 • Published Jul 29, 2024 • 11

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29, 2024 • 36

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31, 2024 • 22

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Paper • 2407.21646 • Published Jul 31, 2024 • 18

ShieldGemma: Generative AI Content Moderation Based on Gemma

Paper • 2407.21772 • Published Jul 31, 2024 • 14

Open-Vocabulary Audio-Visual Semantic Segmentation

Paper • 2407.21721 • Published Jul 31, 2024 • 8

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 113

OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1, 2024 • 24

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Paper • 2407.21794 • Published Jul 31, 2024 • 5

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3, 2024 • 80

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Paper • 2408.02657 • Published Aug 5, 2024 • 33

Language Model Can Listen While Speaking

Paper • 2408.02622 • Published Aug 5, 2024 • 39

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Paper • 2408.02210 • Published Aug 5, 2024 • 8

Operationalizing Contextual Integrity in Privacy-Conscious Assistants

Paper • 2408.02373 • Published Aug 5, 2024 • 4

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 60

Diffusion Models as Data Mining Tools

Paper • 2408.02752 • Published Jul 20, 2024 • 14

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Paper • 2408.01708 • Published Aug 3, 2024 • 4

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Paper • 2408.03615 • Published Aug 7, 2024 • 31

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Paper • 2408.03695 • Published Aug 7, 2024 • 13

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Paper • 2408.03900 • Published Aug 7, 2024 • 10

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8, 2024 • 25

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Paper • 2408.04594 • Published Aug 8, 2024 • 15

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Paper • 2408.04631 • Published Aug 8, 2024 • 10

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9, 2024 • 47

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Paper • 2408.04840 • Published Aug 9, 2024 • 34

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Paper • 2408.04810 • Published Aug 9, 2024 • 24

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12, 2024 • 53

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Paper • 2408.06327 • Published Aug 12, 2024 • 17

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Paper • 2408.05939 • Published Aug 12, 2024 • 15

Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 61

Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Paper • 2408.06663 • Published Aug 13, 2024 • 16

DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9, 2024 • 12

Towards flexible perception with visual memory

Paper • 2408.08172 • Published Aug 15, 2024 • 23

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 98

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Paper • 2408.08459 • Published Aug 15, 2024 • 45

D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Paper • 2408.08441 • Published Aug 15, 2024 • 8

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 51

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20, 2024 • 12

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

Paper • 2408.10119 • Published Aug 19, 2024 • 17

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 59

NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency

Paper • 2408.11054 • Published Aug 20, 2024 • 13

Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model

Paper • 2408.10764 • Published Aug 20, 2024 • 9

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

Paper • 2408.10998 • Published Aug 20, 2024 • 9

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Paper • 2408.10487 • Published Aug 20, 2024 • 7

FocusLLM: Scaling LLM's Context by Parallel Decoding

Paper • 2408.11745 • Published Aug 21, 2024 • 24

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Paper • 2408.11318 • Published Aug 21, 2024 • 56

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Paper • 2408.11817 • Published Aug 21, 2024 • 9

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Paper • 2408.11706 • Published Aug 21, 2024 • 7

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21, 2024 • 18

Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

Paper • 2408.11237 • Published Aug 20, 2024 • 6

Iterative Object Count Optimization for Text-to-image Diffusion Models

Paper • 2408.11721 • Published Aug 21, 2024 • 6

Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22, 2024 • 90

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 56

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 36

Scalable Autoregressive Image Generation with Mamba

Paper • 2408.12245 • Published Aug 22, 2024 • 26

Real-Time Video Generation with Pyramid Attention Broadcast

Paper • 2408.12588 • Published Aug 22, 2024 • 16

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Paper • 2408.12114 • Published Aug 22, 2024 • 14

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Paper • 2408.09787 • Published Aug 19, 2024 • 8

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 125

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Paper • 2408.13257 • Published Aug 23, 2024 • 26

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Paper • 2408.13239 • Published Aug 23, 2024 • 12

Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26, 2024 • 44

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Paper • 2408.13402 • Published Aug 23, 2024 • 18

TVG: A Training-free Transition Video Generation Method with Diffusion Models

Paper • 2408.13413 • Published Aug 24, 2024 • 14

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Paper • 2408.15079 • Published Aug 27, 2024 • 53

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 93

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29, 2024 • 48

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29, 2024 • 53

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Paper • 2408.17253 • Published Aug 30, 2024 • 38

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2, 2024 • 27

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

Paper • 2409.02095 • Published Sep 3, 2024 • 36

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 33

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4, 2024 • 55

Attention Heads of Large Language Models: A Survey

Paper • 2409.03752 • Published Sep 5, 2024 • 89

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Paper • 2409.04410 • Published Sep 6, 2024 • 25

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9, 2024 • 48

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4, 2024 • 72

POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper • 2409.04828 • Published Sep 7, 2024 • 24

Benchmarking Chinese Knowledge Rectification in Large Language Models

Paper • 2409.05806 • Published Sep 9, 2024 • 15

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 56

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Paper • 2409.06135 • Published Sep 10, 2024 • 16

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Paper • 2409.06820 • Published Sep 10, 2024 • 64

MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

Paper • 2409.07129 • Published Sep 11, 2024 • 8

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Paper • 2409.07239 • Published Sep 11, 2024 • 14

Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

Paper • 2409.06277 • Published Sep 10, 2024 • 16

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Paper • 2409.09269 • Published Sep 14, 2024 • 9

One missing piece in Vision and Language: A Survey on Comics Understanding

Paper • 2409.09502 • Published Sep 14, 2024 • 25

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 73

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 111

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17, 2024 • 29

OSV: One Step is Enough for High-Quality Image to Video Generation

Paper • 2409.11367 • Published Sep 17, 2024 • 14

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Paper • 2409.03420 • Published Sep 5, 2024 • 26

InstantDrag: Improving Interactivity in Drag-based Image Editing

Paper • 2409.08857 • Published Sep 13, 2024 • 33

AudioBERT: Audio Knowledge Augmented Language Model

Paper • 2409.08199 • Published Sep 12, 2024 • 5

LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Paper • 2409.08554 • Published Sep 13, 2024 • 3

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 76

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 140

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Paper • 2409.11564 • Published Sep 17, 2024 • 20

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Paper • 2409.12139 • Published Sep 18, 2024 • 12

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19, 2024 • 25

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Paper • 2409.12576 • Published Sep 19, 2024 • 16

Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20, 2024 • 69

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Paper • 2409.13592 • Published Sep 20, 2024 • 50

Portrait Video Editing Empowered by Multimodal Generative Priors

Paper • 2409.13591 • Published Sep 20, 2024 • 17

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Paper • 2409.15278 • Published Sep 23, 2024 • 25

Phantom of Latent for Large Language and Vision Models

Paper • 2409.14713 • Published Sep 23, 2024 • 29

Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

Paper • 2409.14677 • Published Sep 23, 2024 • 16

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Paper • 2409.16160 • Published Sep 24, 2024 • 33

MonoFormer: One Transformer for Both Diffusion and Autoregression

Paper • 2409.16280 • Published Sep 24, 2024 • 18

Seeing Faces in Things: A Model and Dataset for Pareidolia

Paper • 2409.16143 • Published Sep 24, 2024 • 17

Attention Prompting on Image for Large Vision-Language Models

Paper • 2409.17143 • Published Sep 25, 2024 • 7

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 107

MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26, 2024 • 53

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30, 2024 • 56

Visual Question Decomposition on Multimodal Large Language Models

Paper • 2409.19339 • Published Sep 28, 2024 • 9

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 36

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 52

LLaVA-Critic: Learning to Evaluate Multimodal Models

Paper • 2410.02712 • Published Oct 3, 2024 • 35

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Paper • 2410.02762 • Published Oct 3, 2024 • 9

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Paper • 2410.02763 • Published Oct 3, 2024 • 7

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 145

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6, 2024 • 28

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Paper • 2410.05243 • Published Oct 7, 2024 • 18

UniMuMo: Unified Text, Music and Motion Generation

Paper • 2410.04534 • Published Oct 6, 2024 • 19

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Paper • 2410.04734 • Published Oct 7, 2024 • 17

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

Paper • 2410.04932 • Published Oct 7, 2024 • 9

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2, 2024 • 14

ControlAR: Controllable Image Generation with Autoregressive Models

Paper • 2410.02705 • Published Oct 3, 2024 • 10

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Paper • 2410.03290 • Published Oct 4, 2024 • 7

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8, 2024 • 108

Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9, 2024 • 70

Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 64

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Paper • 2410.07171 • Published Oct 9, 2024 • 42

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Paper • 2410.07167 • Published Oct 9, 2024 • 38

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Paper • 2410.06373 • Published Oct 8, 2024 • 36

Pyramidal Flow Matching for Efficient Video Generative Modeling

Paper • 2410.05954 • Published Oct 8, 2024 • 39

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Paper • 2410.05363 • Published Oct 7, 2024 • 45

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Paper • 2410.06244 • Published Oct 8, 2024 • 19

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Paper • 2410.05591 • Published Oct 8, 2024 • 13

Temporal Reasoning Transfer from Text to Video

Paper • 2410.06166 • Published Oct 8, 2024 • 13

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4, 2024 • 36

Intriguing Properties of Large Language and Vision Models

Paper • 2410.04751 • Published Oct 7, 2024 • 16

Progressive Autoregressive Video Diffusion Models

Paper • 2410.08151 • Published Oct 10, 2024 • 16

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Paper • 2410.05210 • Published Oct 7, 2024 • 10

Self-Boosting Large Language Models with Synthetic Preference Data

Paper • 2410.06961 • Published Oct 9, 2024 • 16

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Paper • 2410.07484 • Published Oct 9, 2024 • 48

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10, 2024 • 24

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Paper • 2410.06154 • Published Oct 8, 2024 • 16

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 85

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9, 2024 • 36

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Paper • 2410.07133 • Published Oct 9, 2024 • 19

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Paper • 2410.10139 • Published Oct 14, 2024 • 52

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14, 2024 • 26

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Paper • 2410.11779 • Published Oct 15, 2024 • 26

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Paper • 2410.10816 • Published Oct 14, 2024 • 21

Improving Long-Text Alignment for Text-to-Image Diffusion Models

Paper • 2410.11817 • Published Oct 15, 2024 • 15

OMCAT: Omni Context Aware Transformer

Paper • 2410.12109 • Published Oct 15, 2024 • 4

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

Paper • 2410.11623 • Published Oct 15, 2024 • 48

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Paper • 2410.12381 • Published Oct 16, 2024 • 44

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16, 2024 • 31

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 34

Harnessing Webpage UIs for Text-Rich Visual Understanding

Paper • 2410.13824 • Published Oct 17, 2024 • 31

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Paper • 2410.12705 • Published Oct 16, 2024 • 32

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17, 2024 • 38

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Paper • 2410.13757 • Published Oct 17, 2024 • 33

Roadmap towards Superhuman Speech Understanding using Large Language Models

Paper • 2410.13268 • Published Oct 17, 2024 • 34

Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17, 2024 • 93

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

Paper • 2410.13830 • Published Oct 17, 2024 • 25

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Paper • 2410.13085 • Published Oct 16, 2024 • 22

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Paper • 2410.13639 • Published Oct 17, 2024 • 17

VidPanos: Generative Panoramic Videos from Casual Panning Videos

Paper • 2410.13832 • Published Oct 17, 2024 • 13

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 9

γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Paper • 2410.13859 • Published Oct 17, 2024 • 8

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Paper • 2410.13854 • Published Oct 17, 2024 • 11

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

Paper • 2410.13925 • Published Oct 17, 2024 • 24

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15, 2024 • 22

SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Paper • 2410.14745 • Published Oct 17, 2024 • 47

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 67

Baichuan Alignment Technical Report

Paper • 2410.14940 • Published Oct 19, 2024 • 50

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 53

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Paper • 2410.09347 • Published Oct 12, 2024 • 5

AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21, 2024 • 59

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Paper • 2410.16184 • Published Oct 21, 2024 • 24

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Paper • 2410.15316 • Published Oct 20, 2024 • 10

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Paper • 2410.17247 • Published Oct 22, 2024 • 46

Aligning Large Language Models via Self-Steering Optimization

Paper • 2410.17131 • Published Oct 22, 2024 • 23

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Paper • 2410.16267 • Published Oct 21, 2024 • 18

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Paper • 2410.17637 • Published Oct 23, 2024 • 35

Can Knowledge Editing Really Correct Hallucinations?

Paper • 2410.16251 • Published Oct 21, 2024 • 54

LOGO -- Long cOntext aliGnment via efficient preference Optimization

Paper • 2410.18533 • Published Oct 24, 2024 • 42

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Paper • 2410.18798 • Published Oct 24, 2024 • 20

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Paper • 2410.18558 • Published Oct 24, 2024 • 19

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Paper • 2410.17779 • Published Oct 23, 2024 • 8

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Paper • 2410.17856 • Published Oct 23, 2024 • 49

Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published Oct 21, 2024 • 29

GPT-4o System Card

Paper • 2410.21276 • Published Oct 25, 2024 • 83

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper • 2410.21220 • Published Oct 28, 2024 • 10

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 200

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published Oct 29, 2024 • 10

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Paper • 2410.23287 • Published Oct 30, 2024 • 19

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 47

Personalization of Large Language Models: A Survey

Paper • 2411.00027 • Published Oct 29, 2024 • 32

Randomized Autoregressive Visual Generation

Paper • 2411.00776 • Published Nov 1, 2024 • 17

Face Anonymization Made Simple

Paper • 2411.00762 • Published Nov 1, 2024 • 7

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Paper • 2410.24024 • Published Oct 31, 2024 • 49

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Paper • 2411.02337 • Published Nov 4, 2024 • 35

How Far is Video Generation from World Model: A Physical Law Perspective

Paper • 2411.02385 • Published Nov 4, 2024 • 33

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Paper • 2411.02265 • Published Nov 4, 2024 • 24

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published Nov 4, 2024 • 23

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Paper • 2411.02394 • Published Nov 4, 2024 • 17

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Paper • 2411.02359 • Published Nov 4, 2024 • 12

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Paper • 2411.03823 • Published Nov 6, 2024 • 45

Adaptive Length Image Tokenization via Recurrent Allocation

Paper • 2411.02393 • Published Nov 4, 2024 • 12

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Paper • 2411.05003 • Published Nov 7, 2024 • 70

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published Nov 5, 2024 • 25

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Paper • 2411.04952 • Published Nov 7, 2024 • 29

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Paper • 2411.05000 • Published Nov 7, 2024 • 22

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Paper • 2411.04923 • Published Nov 7, 2024 • 21

Analyzing The Language of Visual Tokens

Paper • 2411.05001 • Published Nov 7, 2024 • 24

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 37

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Paper • 2411.04097 • Published Nov 6, 2024 • 5

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 47

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Paper • 2411.07140 • Published Nov 11, 2024 • 33

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Paper • 2411.07126 • Published Nov 11, 2024 • 28

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Paper • 2411.07232 • Published Nov 11, 2024 • 64

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12, 2024 • 30

Autoregressive Models in Vision: A Survey

Paper • 2411.05902 • Published Nov 8, 2024 • 18

MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 65

Sharingan: Extract User Action Sequence from Desktop Recordings

Paper • 2411.08768 • Published Nov 13, 2024 • 10

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 114

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 34

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

Paper • 2411.10323 • Published Nov 15, 2024 • 32

Number it: Temporal Grounding Videos like Flipping Manga

Paper • 2411.10332 • Published Nov 15, 2024 • 14

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 45

Generative World Explorer

Paper • 2411.11844 • Published Nov 18, 2024 • 76

AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 22

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Paper • 2411.09944 • Published Nov 15, 2024 • 12

Adaptive Decoding via Latent Preference Optimization

Paper • 2411.09661 • Published Nov 14, 2024 • 10

StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Paper • 2411.11045 • Published Nov 17, 2024 • 11

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 51

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Paper • 2411.11909 • Published Nov 17, 2024 • 21

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Paper • 2411.10818 • Published Nov 16, 2024 • 25

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Paper • 2411.12044 • Published Nov 18, 2024 • 14

Continuous Speculative Decoding for Autoregressive Image Generation

Paper • 2411.11925 • Published Nov 18, 2024 • 16

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 73

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 43

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Paper • 2411.14432 • Published Nov 21, 2024 • 23

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 16

O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

Paper • 2411.16489 • Published Nov 25, 2024 • 46

One Diffusion to Generate Them All

Paper • 2411.16318 • Published Nov 25, 2024 • 28

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Paper • 2411.16657 • Published Nov 25, 2024 • 19

Factorized Visual Tokenization and Generation

Paper • 2411.16681 • Published Nov 25, 2024 • 19

TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 17

ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 82

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 80

SketchAgent: Language-Driven Sequential Sketch Generation

Paper • 2411.17673 • Published Nov 26, 2024 • 19

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Paper • 2411.17686 • Published Nov 26, 2024 • 20

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Paper • 2411.15296 • Published Nov 22, 2024 • 20

Large Language Model-Brained GUI Agents: A Survey

Paper • 2411.18279 • Published Nov 27, 2024 • 29

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Paper • 2411.17991 • Published Nov 27, 2024 • 5

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Paper • 2411.18203 • Published Nov 27, 2024 • 34

On Domain-Specific Post-Training for Multimodal Large Language Models

Paper • 2411.19930 • Published Nov 29, 2024 • 27

Yi-Lightning Technical Report

Paper • 2412.01253 • Published Dec 2, 2024 • 27

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Paper • 2412.01824 • Published Dec 2, 2024 • 65

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 26

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 33

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Paper • 2412.00174 • Published Nov 29, 2024 • 23

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Paper • 2412.00947 • Published Dec 1, 2024 • 8

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published Dec 3, 2024 • 24

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 126

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Paper • 2412.03069 • Published Dec 4, 2024 • 31

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Paper • 2412.00493 • Published Nov 30, 2024 • 17

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Paper • 2412.03565 • Published Dec 4, 2024 • 11

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 107

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 60

NVILA: Efficient Frontier Visual Language Models

Paper • 2412.04468 • Published Dec 5, 2024 • 58

Negative Token Merging: Image-based Adversarial Feature Guidance

Paper • 2412.01339 • Published Dec 2, 2024 • 23

Personalized Multimodal Large Language Models: A Survey

Paper • 2412.02142 • Published Dec 3, 2024 • 14

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Paper • 2412.01169 • Published Dec 2, 2024 • 13

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Paper • 2412.04449 • Published Dec 5, 2024 • 7

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Paper • 2412.03704 • Published Dec 4, 2024 • 7

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 132

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Paper • 2412.05237 • Published Dec 6, 2024 • 47

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

Paper • 2412.04814 • Published Dec 6, 2024 • 46

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Paper • 2412.04301 • Published Dec 5, 2024 • 36

CompCap: Improving Multimodal Large Language Models with Composite Captions

Paper • 2412.05243 • Published Dec 6, 2024 • 19

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Paper • 2412.05263 • Published Dec 6, 2024 • 11

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 14

Training Large Language Models to Reason in a Continuous Latent Space

Paper • 2412.06769 • Published Dec 9, 2024 • 78

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published Dec 9, 2024 • 21

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published Dec 10, 2024 • 27

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Paper • 2412.04432 • Published Dec 5, 2024 • 16

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Paper • 2412.05939 • Published Dec 8, 2024 • 16

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Paper • 2412.07589 • Published Dec 10, 2024 • 45

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Paper • 2412.03548 • Published Dec 4, 2024 • 17

POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 38

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 45

MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

Paper • 2412.07147 • Published Dec 10, 2024 • 5

StreamChat: Chatting with Streaming Video

Paper • 2412.08646 • Published Dec 11, 2024 • 18

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 94

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 53

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Paper • 2412.09501 • Published Dec 12, 2024 • 45

Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published Dec 11, 2024 • 44

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Paper • 2412.09618 • Published Dec 12, 2024 • 21

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Paper • 2412.08687 • Published Dec 11, 2024 • 13

Arbitrary-steps Image Super-resolution via Diffusion Inversion

Paper • 2412.09013 • Published Dec 12, 2024 • 12

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 139

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 90

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Paper • 2412.09428 • Published Dec 12, 2024 • 7

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 35

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 92

BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 33

VidTok: A Versatile and Open-Source Video Tokenizer

Paper • 2412.13061 • Published Dec 17, 2024 • 8

GUI Agents: A Survey

Paper • 2412.13501 • Published Dec 18, 2024 • 25

Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published Dec 19, 2024 • 73

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 53

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Paper • 2412.14233 • Published Dec 18, 2024 • 6

Large Motion Video Autoencoding with Cross-modal Video VAE

Paper • 2412.17805 • Published Dec 23, 2024 • 24

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Paper • 2412.17295 • Published Dec 23, 2024 • 9

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 26

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

Paper • 2412.14462 • Published Dec 19, 2024 • 15

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Paper • 2412.15191 • Published Dec 19, 2024 • 5

Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 51

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Paper • 2412.15322 • Published Dec 19, 2024 • 18

Sequence Matters: Harnessing Video Models in 3D Super-Resolution

Paper • 2412.11525 • Published Dec 16, 2024 • 10

Diving into Self-Evolving Training for Multimodal Reasoning

Paper • 2412.17451 • Published Dec 23, 2024 • 43

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

Paper • 2412.17153 • Published Dec 22, 2024 • 34

NILE: Internal Consistency Alignment in Large Language Models

Paper • 2412.16686 • Published Dec 21, 2024 • 8

DepthLab: From Partial to Complete

Paper • 2412.18153 • Published Dec 24, 2024 • 34

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Paper • 2412.18450 • Published Dec 24, 2024 • 33

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Paper • 2412.17739 • Published Dec 23, 2024 • 41

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Paper • 2412.18597 • Published Dec 24, 2024 • 19

How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Paper • 2412.18495 • Published Dec 24, 2024 • 8

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Paper • 2412.18609 • Published Dec 24, 2024 • 17

Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published Dec 19, 2024 • 9

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 37

YuLan-Mini: An Open Data-efficient Language Model

Paper • 2412.17743 • Published Dec 23, 2024 • 65

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 18

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

Paper • 2412.18176 • Published Dec 24, 2024 • 15

1.58-bit FLUX

Paper • 2412.18653 • Published Dec 24, 2024 • 77

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Paper • 2412.18619 • Published Dec 16, 2024 • 55

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Paper • 2412.19326 • Published Dec 26, 2024 • 18

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Paper • 2412.19512 • Published Dec 27, 2024 • 8

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published Dec 24, 2024 • 75

Edicho: Consistent Image Editing in the Wild

Paper • 2412.21079 • Published Dec 30, 2024 • 23

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Paper • 2412.21037 • Published Dec 30, 2024 • 24

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Paper • 2412.20750 • Published Dec 30, 2024 • 20

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 99

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Paper • 2501.01427 • Published Jan 2 • 50

LTX-Video: Realtime Video Latent Diffusion

Paper • 2501.00103 • Published Dec 30, 2024 • 42

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 41

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 25

A3: Android Agent Arena for Mobile GUI Agents

Paper • 2501.01149 • Published Jan 2 • 22

Unifying Specialized Visual Encoders for Video Language Models

Paper • 2501.01426 • Published Jan 2 • 21

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper • 2501.01957 • Published Jan 3 • 42

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 49

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 40

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 68

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Paper • 2501.03262 • Published Jan 4 • 90

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 42

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

Paper • 2501.04561 • Published Jan 8 • 16

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Paper • 2501.04575 • Published Jan 8 • 23

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Paper • 2501.05366 • Published Jan 9 • 94

DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

Paper • 2501.03271 • Published Jan 5 • 11

The GAN is dead; long live the GAN! A Modern GAN Baseline

Paper • 2501.05441 • Published Jan 9 • 88

Enhancing Human-Like Responses in Large Language Models

Paper • 2501.05032 • Published Jan 9 • 49

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9 • 37

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Paper • 2501.05122 • Published Jan 9 • 19

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Paper • 2501.04377 • Published Jan 8 • 14

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 67

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10 • 61

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10 • 28

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Paper • 2501.05510 • Published Jan 9 • 39

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Paper • 2501.06282 • Published Jan 10 • 43

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Paper • 2501.08326 • Published Jan 14 • 32

MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13 • 5

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published Jan 15 • 30

RepVideo: Rethinking Cross-Layer Representation for Video Generation

Paper • 2501.08994 • Published Jan 15 • 15

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Paper • 2501.05452 • Published Jan 9 • 15

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Paper • 2501.05707 • Published Jan 10 • 20

VideoAuteur: Towards Long Narrative Video Generation

Paper • 2501.06173 • Published Jan 10 • 31

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Paper • 2501.06842 • Published Jan 12 • 15

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

Paper • 2501.06708 • Published Jan 12 • 5

MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 273

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Paper • 2501.07730 • Published Jan 13 • 16

HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

Paper • 2501.08292 • Published Jan 14 • 17

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Paper • 2501.07888 • Published Jan 14 • 15

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Paper • 2501.08197 • Published Jan 14 • 8

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14 • 7

MINIMA: Modality Invariant Image Matching

Paper • 2412.19412 • Published Dec 27, 2024 • 4

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Paper • 2501.09751 • Published Jan 16 • 47

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Paper • 2501.09755 • Published Jan 16 • 34

Do generative video models learn physical principles from watching videos?

Paper • 2501.09038 • Published Jan 14 • 32

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Paper • 2501.09747 • Published Jan 16 • 23

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Paper • 2501.09781 • Published Jan 16 • 25

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published 26 days ago • 82

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published 27 days ago • 28

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Paper • 2501.13926 • Published 24 days ago • 36

Baichuan-Omni-1.5 Technical Report

Paper • 2501.15368 • Published 22 days ago • 56

Qwen2.5-1M Technical Report

Paper • 2501.15383 • Published 22 days ago • 57

Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published 20 days ago • 26

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Paper • 2501.15907 • Published 20 days ago • 15

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Paper • 2403.09193 • Published Mar 14, 2024 • 9

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published 19 days ago • 105

PixelWorld: Towards Perceiving Everything as Pixels

Paper • 2501.19339 • Published 16 days ago • 15

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Paper • 2502.01061 • Published 14 days ago • 174

Process Reinforcement through Implicit Rewards

Paper • 2502.01456 • Published 13 days ago • 53

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published 13 days ago • 34

Humanity's Last Exam

Paper • 2501.14249 • Published 24 days ago • 62

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper • 2501.13106 • Published 25 days ago • 83

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Paper • 2501.12599 • Published 26 days ago • 93

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Paper • 2501.12895 • Published 25 days ago • 56

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published 25 days ago • 319

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Paper • 2502.03275 • Published 11 days ago • 12

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Paper • 2502.03032 • Published 11 days ago • 54

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

Paper • 2502.04328 • Published 10 days ago • 24

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Paper • 2502.05173 • Published 9 days ago • 60

Fast Video Generation with Sliding Tile Attention

Paper • 2502.04507 • Published 10 days ago • 46

Goku: Flow Based Video Generative Foundation Models

Paper • 2502.04896 • Published 9 days ago • 80

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Paper • 2502.05171 • Published 9 days ago • 102

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Paper • 2502.05178 • Published 9 days ago • 10

On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

Paper • 2502.04363 • Published 12 days ago • 10

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Paper • 2502.06703 • Published 6 days ago • 123

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published 5 days ago • 24

Expect the Unexpected: FailSafe Long Context QA for Finance

Paper • 2502.06329 • Published 6 days ago • 118

Magic 1-For-1: Generating One Minute Video Clips within One Minute

Paper • 2502.07701 • Published 5 days ago • 29

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published 4 days ago • 35

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Paper • 2502.07870 • Published 5 days ago • 39

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Paper • 2502.08047 • Published 5 days ago • 22

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published 5 days ago • 37