Flowing from Words to Pixels: A Framework for Cross-Modality Evolution Paper • 2412.15213 • Published 1 day ago • 17
No More Adam: Learning Rate Scaling at Initialization is All You Need Paper • 2412.11768 • Published 5 days ago • 35
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper • 2412.13663 • Published 3 days ago • 82
Autoregressive Video Generation without Vector Quantization Paper • 2412.14169 • Published 2 days ago • 11
ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers Paper • 2412.12571 • Published 4 days ago • 7
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 8 days ago • 67
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published 7 days ago • 126
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation Paper • 2412.09428 • Published 9 days ago • 7
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning Paper • 2412.14164 • Published 2 days ago • 1
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper • 2412.09596 • Published 8 days ago • 89
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation Paper • 2412.09585 • Published 8 days ago • 10
Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale Paper • 2412.09548 • Published 8 days ago • 1
Multimodal Latent Language Modeling with Next-Token Diffusion Paper • 2412.08635 • Published 9 days ago • 38
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation Paper • 2412.07147 • Published 11 days ago • 5