Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models Paper • 2406.17294 • Published Jun 25 • 10
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27 • 51
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model Paper • 2406.20076 • Published Jun 28 • 8
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation Paper • 2407.02869 • Published Jul 3 • 18
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper • 2407.04051 • Published Jul 4 • 35
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation Paper • 2407.06135 • Published Jul 8 • 20
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10 • 40
SEED-Story: Multimodal Long Story Generation with Large Language Model Paper • 2407.08683 • Published Jul 11 • 22