NVILA: Efficient Frontier Visual Language Models Paper • 2412.04468 • Published about 1 month ago • 57
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset Paper • 2410.22325 • Published Oct 29, 2024 • 10
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree Paper • 2410.16268 • Published Oct 21, 2024 • 66
PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation Paper • 2410.01680 • Published Oct 2, 2024 • 32
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Paper • 2408.13257 • Published Aug 23, 2024 • 25
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 124
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19, 2024 • 51
VideoGameBunny: Towards vision assistants for video games Paper • 2407.15295 • Published Jul 21, 2024 • 22
Shape of Motion: 4D Reconstruction from a Single Video Paper • 2407.13764 • Published Jul 18, 2024 • 19
OpenVLA: An Open-Source Vision-Language-Action Model Paper • 2406.09246 • Published Jun 13, 2024 • 36