Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement Paper • 2411.06558 • Published Nov 10 • 34
SlimLM: An Efficient Small Language Model for On-Device Document Assistance Paper • 2411.09944 • Published about 1 month ago • 12
Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing Paper • 2411.19460 • Published 16 days ago • 10
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published 9 days ago • 40
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment Paper • 2412.04814 • Published 9 days ago • 39
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation Paper • 2412.04445 • Published 10 days ago • 20