InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language Paper • 2305.05662 • Published May 9, 2023 • 4
Learning Human Motion Representations: A Unified Perspective Paper • 2210.06551 • Published Oct 12, 2022
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks Paper • 2406.08394 • Published Jun 12, 2024
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions Paper • 2407.20962 • Published Jul 30, 2024
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 30 days ago • 123
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 30 days ago • 123
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published Apr 25, 2024 • 55
ControlLLM: Augment Language Models with Tools by Searching on Graphs Paper • 2310.17796 • Published Oct 26, 2023 • 17