zerozeyi
's Collections
VisionLM
updated
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper
•
2402.04252
•
Published
•
21
Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models
Paper
•
2402.03749
•
Published
•
9
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
•
2402.04615
•
Published
•
33
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance
Loss
Paper
•
2402.05008
•
Published
•
19
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Paper
•
2402.05930
•
Published
•
35
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
12
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
•
2402.06118
•
Published
•
13
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
•
2402.07456
•
Published
•
39
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Paper
•
2402.07872
•
Published
•
14
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
•
2402.07865
•
Published
•
11
World Model on Million-Length Video And Language With RingAttention
Paper
•
2402.08268
•
Published
•
34
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
Vision-language Adapter
Paper
•
2402.10896
•
Published
•
14
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
•
2402.10986
•
Published
•
74
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
•
2402.12226
•
Published
•
37
CoLLaVO: Crayon Large Language and Vision mOdel
Paper
•
2402.11248
•
Published
•
18
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Paper
•
2402.11690
•
Published
•
6
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper
•
2402.13217
•
Published
•
19
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
•
2402.13250
•
Published
•
19
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
11
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
Deceptive Prompts
Paper
•
2402.13220
•
Published
•
12
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
5
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
•
2402.14818
•
Published
•
23
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
17
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
•
2402.17177
•
Published
•
88
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper
•
2402.19479
•
Published
•
30
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper
•
2403.01422
•
Published
•
24
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper
•
2403.01487
•
Published
•
14
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
•
2403.02677
•
Published
•
16
Modeling Collaborator: Enabling Subjective Vision Classification With
Minimal Human Effort via LLM Tool-Use
Paper
•
2403.02626
•
Published
•
9
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
Paper
•
2403.03194
•
Published
•
11
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
Paper
•
2403.03003
•
Published
•
8
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
123
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
73
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
•
2403.07750
•
Published
•
19
DragAnything: Motion Control for Anything using Entity Representation
Paper
•
2403.07420
•
Published
•
12
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
Acceleration for Large Vision-Language Models
Paper
•
2403.06764
•
Published
•
24
VideoMamba: State Space Model for Efficient Video Understanding
Paper
•
2403.06977
•
Published
•
24
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
40
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
•
2403.05530
•
Published
•
51
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
39
VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
Paper
•
2403.05438
•
Published
•
15
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
50
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
28
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
13
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
•
2403.11481
•
Published
•
10
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
Understanding
Paper
•
2403.12895
•
Published
•
28
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
•
2403.12596
•
Published
•
9
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
50
Can large language models explore in-context?
Paper
•
2403.15371
•
Published
•
30
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
•
2403.15377
•
Published
•
17
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate
Time series
Paper
•
2403.15360
•
Published
•
11
VidLA: Video-Language Alignment at Scale
Paper
•
2403.14870
•
Published
•
9
ViTAR: Vision Transformer with Any Resolution
Paper
•
2403.18361
•
Published
•
48
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
42
sDPO: Don't Use Your Data All at Once
Paper
•
2403.19270
•
Published
•
32
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
•
2403.18978
•
Published
•
12
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
•
2403.20331
•
Published
•
14
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
29
Direct Preference Optimization of Video Large Multimodal Models from
Language Model Reward
Paper
•
2404.01258
•
Published
•
10
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
•
2404.03413
•
Published
•
22
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
•
2404.03118
•
Published
•
19
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
•
2404.03653
•
Published
•
28
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
57
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
•
2404.05726
•
Published
•
18
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Paper
•
2404.05674
•
Published
•
12
Koala: Key frame-conditioned long video-LLM
Paper
•
2404.04346
•
Published
•
5
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
Adapting LLaMA Decoder to Vision Transformer
Paper
•
2404.06773
•
Published
•
13
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
14
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper
•
2404.07448
•
Published
•
10
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
28
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper
•
2404.09990
•
Published
•
11
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
Large Language Models
Paper
•
2404.09204
•
Published
•
10
On Speculative Decoding for Multimodal Large Language Models
Paper
•
2404.08856
•
Published
•
12
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
36
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
23
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
•
2404.14239
•
Published
•
8
A Multimodal Automated Interpretability Agent
Paper
•
2404.14394
•
Published
•
19
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
•
2404.12803
•
Published
•
28
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
27
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
•
2404.15653
•
Published
•
25
Editable Image Elements for Controllable Synthesis
Paper
•
2404.16029
•
Published
•
9
MoDE: CLIP Data Experts via Clustering
Paper
•
2404.16030
•
Published
•
11
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
49
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
•
2404.16375
•
Published
•
16
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
33
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
•
2404.16845
•
Published
•
6
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Paper
•
2404.17672
•
Published
•
18
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual
and Action Representations
Paper
•
2404.17521
•
Published
•
12
Automatic Creative Selection with Cross-Modal Matching
Paper
•
2405.00029
•
Published
•
7
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
90
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large
Language Models in Code Generation from Scientific Plots
Paper
•
2405.07990
•
Published
•
15
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
•
2405.08344
•
Published
•
10
Understanding the performance gap between online and offline alignment
algorithms
Paper
•
2405.08448
•
Published
•
11
SpeechVerse: A Large-scale Generalizable Audio Language Model
Paper
•
2405.08295
•
Published
•
10
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large
Language Models
Paper
•
2405.08317
•
Published
•
8
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper
•
2405.09215
•
Published
•
14
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
78
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
25
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
109
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
25
Toon3D: Seeing Cartoons from a New Perspective
Paper
•
2405.10320
•
Published
•
19
Octo: An Open-Source Generalist Robot Policy
Paper
•
2405.12213
•
Published
•
22
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
•
2405.12107
•
Published
•
23
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
142
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
25
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
•
2405.14129
•
Published
•
9
CamViG: Camera Aware Image-to-Video Generation with Multimodal
Transformers
Paper
•
2405.13195
•
Published
•
8
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
52
Denoising LM: Pushing the Limits of Error Correction Models for Speech
Recognition
Paper
•
2405.15216
•
Published
•
11
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
77
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
29
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding
Models
Paper
•
2405.17428
•
Published
•
14
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Dense Connector for MLLMs
Paper
•
2405.13800
•
Published
•
20
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Paper
•
2405.14598
•
Published
•
11
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
•
2405.20204
•
Published
•
27
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
•
2405.18669
•
Published
•
11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
•
2405.20340
•
Published
•
19
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
15
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper
•
2406.00888
•
Published
•
29
Parrot: Multilingual Visual Instruction Tuning
Paper
•
2406.02539
•
Published
•
35
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
•
2406.02884
•
Published
•
13
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
69
AgentGym: Evolving Large Language Model-based Agents across Diverse
Environments
Paper
•
2406.04151
•
Published
•
14
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
Navigation via Multi-Agent Collaboration
Paper
•
2406.01014
•
Published
•
29
Vript: A Video Is Worth Thousands of Words
Paper
•
2406.06040
•
Published
•
19
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper
•
2406.07550
•
Published
•
52
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
•
2406.06911
•
Published
•
10
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
•
2406.07476
•
Published
•
29
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
38
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
•
2406.08407
•
Published
•
23
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
51
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
35
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
20
TroL: Traversal of Layers for Large Language and Vision Models
Paper
•
2406.12246
•
Published
•
32
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper
•
2406.12275
•
Published
•
28
Benchmarking Multi-Image Understanding in Vision and Language Models:
Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper
•
2406.12742
•
Published
•
13
Adversarial Attacks on Multimodal Agents
Paper
•
2406.12814
•
Published
•
4
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of
Multimodal Large Language Models
Paper
•
2406.11230
•
Published
•
28
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations
for Vision Foundation Models
Paper
•
2406.12649
•
Published
•
14
Understanding Hallucinations in Diffusion Models through Mode
Interpolation
Paper
•
2406.09358
•
Published
•
4
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Paper
•
2406.09356
•
Published
•
4
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
•
2406.09406
•
Published
•
11
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
17
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
•
2406.09411
•
Published
•
17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
14
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal
Prompts
Paper
•
2406.09162
•
Published
•
13
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
28
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
•
2406.08451
•
Published
•
22
Paper
•
2406.04127
•
Published
•
36
NaRCan: Natural Refined Canonical Image with Integration of Diffusion
Prior for Video Editing
Paper
•
2406.06523
•
Published
•
47
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Paper
•
2406.08487
•
Published
•
10
VCR: Visual Caption Restoration
Paper
•
2406.06462
•
Published
•
10
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
47
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
28
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
•
2406.08552
•
Published
•
19
Physics3D: Learning Physical Properties of 3D Gaussians via Video
Diffusion
Paper
•
2406.04338
•
Published
•
32
Hibou: A Family of Foundational Vision Transformers for Pathology
Paper
•
2406.05074
•
Published
•
6
Make It Count: Text-to-Image Generation with an Accurate Number of
Objects
Paper
•
2406.10210
•
Published
•
70
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
•
2406.08973
•
Published
•
84
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
56
Exploring the Role of Large Language Models in Prompt Encoding for
Diffusion Models
Paper
•
2406.11831
•
Published
•
17
From Pixels to Prose: A Large Dataset of Dense Image Captions
Paper
•
2406.10328
•
Published
•
16
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
•
2406.14544
•
Published
•
32
WildVision: Evaluating Vision-Language Models in the Wild with Human
Preferences
Paper
•
2406.11069
•
Published
•
11
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
10
Paper
•
2406.11775
•
Published
•
7
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
•
2406.11251
•
Published
•
6
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
•
2406.10601
•
Published
•
63
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
•
2406.14515
•
Published
•
27
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation
Modelling in Large Multimodal Models
Paper
•
2406.14035
•
Published
•
9
ICAL: Continual Learning of Multimodal Agents by Transforming
Trajectories into Actionable Insights
Paper
•
2406.14596
•
Published
•
4
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
•
2406.11403
•
Published
•
4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
•
2406.16338
•
Published
•
20
Long Context Transfer from Language to Vision
Paper
•
2406.16852
•
Published
•
25