-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 86 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 31
Collections
Discover the best community collections!
Collections including paper arxiv:2412.10360
-
Scalable Diffusion Models with Transformers
Paper • 2212.09748 • Published • 17 -
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Paper • 2311.15127 • Published • 12 -
Learning Transferable Visual Models From Natural Language Supervision
Paper • 2103.00020 • Published • 11 -
U-Net: Convolutional Networks for Biomedical Image Segmentation
Paper • 1505.04597 • Published • 8
-
Qwen Technical Report
Paper • 2309.16609 • Published • 35 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 7 -
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Paper • 2311.07919 • Published • 9 -
195📷🎨👀
Qwen-VL-Plus
-
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Paper • 2403.09626 • Published • 13 -
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 32 -
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper • 2403.13501 • Published • 9 -
LITA: Language Instructed Temporal-Localization Assistant
Paper • 2403.19046 • Published • 18
-
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper • 2402.14289 • Published • 19 -
ImageBind: One Embedding Space To Bind Them All
Paper • 2305.05665 • Published • 4 -
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Paper • 2206.02770 • Published • 3
-
Video as the New Language for Real-World Decision Making
Paper • 2402.17139 • Published • 18 -
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Paper • 2310.19512 • Published • 15 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 27 -
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Paper • 2401.09047 • Published • 13
-
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Paper • 2312.08578 • Published • 16 -
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Paper • 2312.08583 • Published • 9 -
Vision-Language Models as a Source of Rewards
Paper • 2312.09187 • Published • 11 -
StemGen: A music generation model that listens
Paper • 2312.08723 • Published • 47