-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2501.06186
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 61 -
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper • 2501.01904 • Published • 32 -
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper • 2501.12599 • Published • 103 -
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper • 2503.01785 • Published • 64
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 61 -
apple/OpenELM
Updated • 1.43k -
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Text Generation • Updated • 1.59M • • 1.03k -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 346
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 61 -
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
Paper • 2501.02393 • Published • 8 -
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper • 2501.01904 • Published • 32 -
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Paper • 2412.14711 • Published • 16
-
An Empirical Study of Autoregressive Pre-training from Videos
Paper • 2501.05453 • Published • 37 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 61 -
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper • 2501.05707 • Published • 20 -
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
Paper • 2502.17535 • Published • 8
-
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 50 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 61 -
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper • 2501.09012 • Published • 10
-
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 100 -
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Paper • 2412.19723 • Published • 82 -
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper • 2412.04467 • Published • 107 -
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper • 2412.03555 • Published • 129