CoMP: Continual Multimodal Pre-training for Vision Foundation Models Paper โข 2503.18931 โข Published 1 day ago โข 16
Long-Context Autoregressive Video Modeling with Next-Frame Prediction Paper โข 2503.19325 โข Published 1 day ago โข 47
Training-free Diffusion Acceleration with Bottleneck Sampling Paper โข 2503.18940 โข Published 1 day ago โข 12
TULIP: Towards Unified Language-Image Pretraining Paper โข 2503.15485 โข Published 7 days ago โข 43
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity Paper โข 2503.07677 โข Published 16 days ago โข 80
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video Paper โข 2503.11647 โข Published 12 days ago โข 119
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation Paper โข 2503.09151 โข Published 14 days ago โข 29
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper โข 2503.08638 โข Published 15 days ago โข 59
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning Paper โข 2503.04812 โข Published 22 days ago โข 13
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper โข 2503.07027 โข Published 16 days ago โข 26
Token-Efficient Long Video Understanding for Multimodal LLMs Paper โข 2503.04130 โข Published 20 days ago โข 84
UniTok: A Unified Tokenizer for Visual Generation and Understanding Paper โข 2502.20321 โข Published 27 days ago โข 29
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks Paper โข 2502.17157 โข Published 30 days ago โข 51