Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published about 13 hours ago • 4
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published 7 days ago • 25
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs Paper • 2411.14199 • Published about 16 hours ago • 10
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper • 2411.14432 • Published about 12 hours ago • 6
Vision/multimodal Models Collection Collection of the most popular vision models including Llama 3.2, LlaVa, Qwen2 VL, Pixtral, PaliGemma and more! • 22 items • Updated about 21 hours ago • 3
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper • 2411.13281 • Published 2 days ago • 14
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration Paper • 2411.10958 • Published 5 days ago • 38
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Paper • 2411.10161 • Published 7 days ago • 6
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published 3 days ago • 40
AnimateAnything: Consistent and Controllable Animation for Video Generation Paper • 2411.10836 • Published 6 days ago • 18
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Paper • 2411.10640 • Published 6 days ago • 38
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement Paper • 2411.06558 • Published 12 days ago • 29
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use Paper • 2411.10323 • Published 7 days ago • 26
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published 7 days ago • 89
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation Paper • 2411.08033 • Published 10 days ago • 21
Thinking LLMs: General Instruction Following with Thought Generation Paper • 2410.10630 • Published Oct 14 • 16
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Paper • 2411.09595 • Published 8 days ago • 65
MagicQuill: An Intelligent Interactive Image Editing System Paper • 2411.09703 • Published 8 days ago • 50
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination Paper • 2411.03823 • Published 16 days ago • 43