Maya: An Instruction Finetuned Multilingual Multimodal Model Paper β’ 2412.07112 β’ Published 24 days ago β’ 25
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper β’ 2412.10360 β’ Published 21 days ago β’ 135
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper β’ 2412.03555 β’ Published 30 days ago β’ 119
FreeInit: Bridging Initialization Gap in Video Diffusion Models Paper β’ 2312.07537 β’ Published Dec 12, 2023 β’ 25
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Paper β’ 2311.05437 β’ Published Nov 9, 2023 β’ 48
MΓΆbius Transform for Mitigating Perspective Distortions in Representation Learning Paper β’ 2405.02296 β’ Published Mar 7, 2024 β’ 4
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields Paper β’ 2404.01300 β’ Published Apr 1, 2024 β’ 4
DriveLM: Driving with Graph Visual Question Answering Paper β’ 2312.14150 β’ Published Dec 21, 2023 β’ 4
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper β’ 2408.11039 β’ Published Aug 20, 2024 β’ 58
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering Paper β’ 2408.09174 β’ Published Aug 17, 2024 β’ 51
Meltemi: The first open Large Language Model for Greek Paper β’ 2407.20743 β’ Published Jul 30, 2024 β’ 67
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey Paper β’ 2407.21794 β’ Published Jul 31, 2024 β’ 5
Gemma 2: Improving Open Language Models at a Practical Size Paper β’ 2408.00118 β’ Published Jul 31, 2024 β’ 75
SpaceVLMs Collection Features VLMs fine-tuned for enhanced spatial reasoning using a synthetic data pipeline similar to Spatial VLM. β’ 3 items β’ Updated Jul 26, 2024 β’ 1
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding Paper β’ 2407.12594 β’ Published Jul 17, 2024 β’ 19