Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.14835

about 22 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 39
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 20

about 14 hours ago

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24 • 58
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2 • 21
LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 123

about 3 hours ago

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24 • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 86
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 31

about 11 hours ago

Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published 1 day ago • 49

about 23 hours ago

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Paper • 2410.16153 • Published Oct 21 • 43
AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21 • 58
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16 • 30
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

Paper • 2410.01744 • Published Oct 2 • 26

about 4 hours ago

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Paper • 2408.16767 • Published Aug 29 • 30
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Paper • 2411.16657 • Published 25 days ago • 17
Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published 2 days ago • 11
Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published 1 day ago • 49

about 17 hours ago

Human-like Episodic Memory for Infinite Context LLMs

Paper • 2407.09450 • Published Jul 12 • 59
MUSCLE: A Model Update Strategy for Compatible LLM Evolution

Paper • 2407.09435 • Published Jul 12 • 20
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Paper • 2407.09121 • Published Jul 12 • 5
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Paper • 2407.14482 • Published Jul 19 • 25

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs