Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published Oct 3 • 52
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging Paper • 2410.01215 • Published Oct 2 • 30
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25 • 104
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs Paper • 2409.14988 • Published Sep 23 • 21
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Paper • 2409.12959 • Published Sep 19 • 36
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers Paper • 2409.04109 • Published Sep 6 • 43
Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise Paper • 2410.03017 • Published Oct 3 • 26
Addition is All You Need for Energy-efficient Language Models Paper • 2410.00907 • Published Oct 1 • 144
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Paper • 2410.02707 • Published Oct 3 • 47
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References Paper • 2410.05193 • Published Oct 7 • 12
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Paper • 2411.04905 • Published Nov 7 • 111
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning Paper • 2411.05003 • Published Nov 7 • 70
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion Paper • 2411.04928 • Published Nov 7 • 48
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation Paper • 2411.04999 • Published Nov 7 • 16
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond Paper • 2411.03590 • Published Nov 6 • 9
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level Paper • 2411.03562 • Published Nov 5 • 60
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems Paper • 2411.02959 • Published Nov 5 • 64
Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge Paper • 2411.02657 • Published Nov 4 • 5
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents Paper • 2410.24024 • Published Oct 31 • 48
How Far is Video Generation from World Model: A Physical Law Perspective Paper • 2411.02385 • Published Nov 4 • 33
Survey of Cultural Awareness in Language Models: Text and Beyond Paper • 2411.00860 • Published Oct 30 • 23
Training-free Regional Prompting for Diffusion Transformers Paper • 2411.02395 • Published Nov 4 • 25
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models Paper • 2411.00492 • Published Nov 1 • 6
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published Oct 30 • 46
DELTA: Dense Efficient Long-range 3D Tracking for any video Paper • 2410.24211 • Published Oct 31 • 8
Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning Paper • 2410.21845 • Published Oct 29 • 12
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset Paper • 2410.22325 • Published Oct 29 • 10
Animate-X: Universal Character Image Animation with Enhanced Motion Representation Paper • 2410.10306 • Published Oct 14 • 54
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant Paper • 2410.18603 • Published Oct 24 • 31
LongReward: Improving Long-context Large Language Models with AI Feedback Paper • 2410.21252 • Published Oct 28 • 17
Teach Multimodal LLMs to Comprehend Electrocardiographic Images Paper • 2410.19008 • Published Oct 21 • 23
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch Paper • 2410.18693 • Published Oct 24 • 40
WorldSimBench: Towards Video Generation Models as World Simulators Paper • 2410.18072 • Published Oct 23 • 18
DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes Paper • 2410.18084 • Published Oct 23 • 13
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes Paper • 2410.17249 • Published Oct 22 • 41
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors Paper • 2410.16271 • Published Oct 21 • 80
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation Paper • 2410.13232 • Published Oct 17 • 40
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures Paper • 2410.13754 • Published Oct 17 • 74
MobA: A Two-Level Agent System for Efficient Mobile Task Automation Paper • 2410.13757 • Published Oct 17 • 31
Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free Paper • 2410.10814 • Published Oct 14 • 48
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts Paper • 2410.10626 • Published Oct 14 • 37
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19 • 47
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use Paper • 2411.10323 • Published Nov 15 • 31
Sharingan: Extract User Action Sequence from Desktop Recordings Paper • 2411.08768 • Published Nov 13 • 10
Hermes: A Large Language Model Framework on the Journey to Autonomous Networks Paper • 2411.06490 • Published Nov 10 • 6
Large Language Models Can Self-Improve in Long-context Reasoning Paper • 2411.08147 • Published Nov 12 • 62
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection Paper • 2411.08868 • Published Nov 13 • 12
GRAPE: Generalizing Robot Policy via Preference Alignment Paper • 2411.19309 • Published 24 days ago • 42
On Domain-Specific Post-Training for Multimodal Large Language Models Paper • 2411.19930 • Published 23 days ago • 24
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving Paper • 2411.15139 • Published about 1 month ago • 15
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published 26 days ago • 76
Star Attention: Efficient LLM Inference over Long Sequences Paper • 2411.17116 • Published 27 days ago • 47
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Paper • 2412.07760 • Published 12 days ago • 49
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training Paper • 2412.09619 • Published 10 days ago • 20