Flowing from Words to Pixels: A Framework for Cross-Modality Evolution Paper • 2412.15213 • Published 1 day ago • 17
AnimateAnything: Consistent and Controllable Animation for Video Generation Paper • 2411.10836 • Published Nov 16 • 23
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP Paper • 2308.02487 • Published Aug 4, 2023 • 12
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models Paper • 2406.09416 • Published Jun 13 • 27
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression Paper • 2406.20092 • Published Jun 28
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens Paper • 2410.13863 • Published Oct 17 • 36
ViTamin Family Collection Designing Scalable Vision Models in the Vision-language Era. The best performing model is 'jienengchen/ViTamin-XL-384px'. • 16 items • Updated Apr 11 • 8
An Image is Worth 32 Tokens for Reconstruction and Generation Paper • 2406.07550 • Published Jun 11 • 55