Good Papers - a steveyin Collection

Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

steveyin 's Collections

object detection

Good Papers

updated 4 days ago

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30 • 19
Spectrally Pruned Gaussian Fields with Neural Compensation

Paper • 2405.00676 • Published May 1 • 8
Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Paper • 2404.18212 • Published Apr 28 • 27
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Paper • 2405.00732 • Published Apr 29 • 118
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Paper • 2405.08344 • Published May 14 • 12
LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15 • 87
Octo: An Open-Source Generalist Robot Policy

Paper • 2405.12213 • Published May 20 • 24
FIFO-Diffusion: Generating Infinite Videos from Text without Training

Paper • 2405.11473 • Published May 19 • 53
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Paper • 2406.02523 • Published Jun 4 • 10
Towards a Personal Health Large Language Model

Paper • 2406.06474 • Published Jun 10 • 18
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

Paper • 2406.06216 • Published Jun 10 • 19
Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10 • 25
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

Paper • 2406.06469 • Published Jun 10 • 24
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24 • 57
VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17 • 21
Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26 • 47
Depth Anything V2

Paper • 2406.09414 • Published Jun 13 • 92
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13 • 50
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13 • 36
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13 • 19
Transformers meet Neural Algorithmic Reasoners

Paper • 2406.09308 • Published Jun 13 • 43
MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Paper • 2406.05338 • Published Jun 8 • 39
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11 • 32
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6 • 34
The Prompt Report: A Systematic Survey of Prompting Techniques

Paper • 2406.06608 • Published Jun 6 • 55
An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11 • 55
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Paper • 2406.07472 • Published Jun 11 • 11
Mixture-of-Agents Enhances Large Language Model Capabilities

Paper • 2406.04692 • Published Jun 7 • 55
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Paper • 2406.01014 • Published Jun 3 • 31
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Paper • 2406.02430 • Published Jun 4 • 30
Agentless: Demystifying LLM-based Software Engineering Agents

Paper • 2407.01489 • Published Jul 1 • 42
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2 • 21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3 • 92
TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2 • 21
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Paper • 2407.02869 • Published Jul 3 • 18
Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Paper • 2407.04620 • Published Jul 5 • 27
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Paper • 2407.07061 • Published Jul 9 • 26
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Paper • 2407.06938 • Published Jul 9 • 21
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Paper • 2406.18009 • Published Jun 26 • 19
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Paper • 2406.19741 • Published Jun 28 • 59
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23 • 25
Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Paper • 2406.10539 • Published Jun 15 • 1
Cross Anything: General Quadruped Robot Navigation through Complex Terrains

Paper • 2407.16412 • Published Jul 23 • 5
A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data

Paper • 2407.16680 • Published Jul 23 • 11
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Paper • 2407.14931 • Published Jul 20 • 20
EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19 • 42
The Vision of Autonomic Computing: Can LLMs Make It a Reality?

Paper • 2407.14402 • Published Jul 19 • 13
Internal Consistency and Self-Feedback in Large Language Models: A Survey

Paper • 2407.14507 • Published Jul 19 • 46
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Paper • 2407.13833 • Published Jul 18 • 11
3D Gaussian Editing with A Single Image

Paper • 2408.07540 • Published Aug 14 • 10
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17 • 21
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Paper • 2408.10195 • Published Aug 19 • 12
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

Paper • 2408.10161 • Published Aug 19 • 13
Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Paper • 2408.07931 • Published Aug 15 • 19
Automated Design of Agentic Systems

Paper • 2408.08435 • Published Aug 15 • 38
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 119
gsplat: An Open-Source Library for Gaussian Splatting

Paper • 2409.06765 • Published Sep 10 • 13
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10 • 55
GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers

Paper • 2409.04196 • Published Sep 6 • 11
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29 • 52
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

Paper • 2408.16768 • Published Aug 29 • 26
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8 • 24
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Paper • 2409.11564 • Published Sep 17 • 19
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends

Paper • 2409.14195 • Published Sep 21 • 11
Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Paper • 2409.18121 • Published Sep 26 • 7
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Paper • 2409.17280 • Published Sep 25 • 9
Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study

Paper • 2409.17580 • Published Sep 26 • 7
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

Paper • 2410.01744 • Published Oct 2 • 25
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Paper • 2410.00531 • Published Oct 1 • 29
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2 • 40
FAN: Fourier Analysis Networks

Paper • 2410.02675 • Published Oct 3 • 24
Differential Transformer

Paper • 2410.05258 • Published Oct 7 • 166
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16 • 30
Revealing the Barriers of Language Agents in Planning

Paper • 2410.12409 • Published Oct 16 • 23
What Matters in Transformers? Not All Attention is Needed

Paper • 2406.15786 • Published Jun 22 • 29
EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation

Paper • 2410.09704 • Published Oct 13 • 11
Benchmarking Agentic Workflow Generation

Paper • 2410.07869 • Published Oct 10 • 25
Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10 • 24
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Paper • 2409.16299 • Published Sep 9 • 9
DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes

Paper • 2410.18084 • Published Oct 23 • 12
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Paper • 2410.13924 • Published Oct 17 • 6
LLM-based Optimization of Compound AI Systems: A Survey

Paper • 2410.16392 • Published Oct 21 • 13
Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21 • 17
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15 • 20
Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24 • 34
WAFFLE: Multi-Modal Model for Automated Front-End Development

Paper • 2410.18362 • Published Oct 24 • 11
A Survey of Small Language Models

Paper • 2410.20011 • Published Oct 25 • 38
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant

Paper • 2410.18603 • Published Oct 24 • 30
GPT-4o System Card

Paper • 2410.21276 • Published Oct 25 • 79
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

Paper • 2410.23743 • Published 26 days ago • 59
AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21 • 57
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Paper • 2410.16271 • Published Oct 21 • 80

Collection guide
Browse collections

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs