Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper β’ 2406.16860 β’ Published 7 days ago β’ 48
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Paper β’ 2402.19479 β’ Published Feb 29 β’ 30
Towards A Better Metric for Text-to-Video Generation Paper β’ 2401.07781 β’ Published Jan 15 β’ 13
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action Paper β’ 2312.17172 β’ Published Dec 28, 2023 β’ 25
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos Paper β’ 2312.15770 β’ Published Dec 25, 2023 β’ 12
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models Paper β’ 2312.09608 β’ Published Dec 15, 2023 β’ 13
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Paper β’ 2312.08578 β’ Published Dec 14, 2023 β’ 15
CCM: Adding Conditional Controls to Text-to-Image Consistency Models Paper β’ 2312.06971 β’ Published Dec 12, 2023 β’ 10
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning Paper β’ 2311.12631 β’ Published Nov 21, 2023 β’ 12
PixArt-Ξ±: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis Paper β’ 2310.00426 β’ Published Sep 30, 2023 β’ 60
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation Paper β’ 2309.15818 β’ Published Sep 27, 2023 β’ 18
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack Paper β’ 2309.15807 β’ Published Sep 27, 2023 β’ 30
Bootstrapping Objectness from Videos by Relaxed Common Fate and Visual Grouping Paper β’ 2304.08025 β’ Published Apr 17, 2023 β’ 2
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models Paper β’ 2305.13655 β’ Published May 23, 2023 β’ 6