2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published 12 days ago • 92
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper • 2412.19723 • Published 17 days ago • 78
OminiControl: Minimal and Universal Control for Diffusion Transformer Paper • 2411.15098 • Published Nov 22, 2024 • 53
Identity-Preserving Text-to-Video Generation by Frequency Decomposition Paper • 2411.17440 • Published Nov 26, 2024 • 35
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos Paper • 2410.23287 • Published Oct 30, 2024 • 19
Can MLLMs Understand the Deep Implication Behind Chinese Images? Paper • 2410.13854 • Published Oct 17, 2024 • 10
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control Paper • 2410.13830 • Published Oct 17, 2024 • 24
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens Paper • 2410.13863 • Published Oct 17, 2024 • 37
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree Paper • 2410.16268 • Published Oct 21, 2024 • 66
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting Paper • 2410.17856 • Published Oct 23, 2024 • 49
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Paper • 2410.17637 • Published Oct 23, 2024 • 34
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions Paper • 2406.07502 • Published Jun 11, 2024 • 1