VideoGUI: A Benchmark for GUI Automation from Instructional Videos Paper • 2406.10227 • Published 15 days ago • 8
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality Paper • 2406.08845 • Published 16 days ago • 8
Designing a Dashboard for Transparency and Control of Conversational AI Paper • 2406.07882 • Published 17 days ago • 9
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning Paper • 2406.08973 • Published 16 days ago • 85
Make It Count: Text-to-Image Generation with an Accurate Number of Objects Paper • 2406.10210 • Published 15 days ago • 74
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published 18 days ago • 30
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion Paper • 2406.04338 • Published 23 days ago • 32
iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper • 2405.15223 • Published May 24 • 11
ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving Paper • 2404.16771 • Published Apr 25 • 16
Interactive3D: Create What You Want by Interactive 3D Generation Paper • 2404.16510 • Published Apr 25 • 18
Align Your Steps: Optimizing Sampling Schedules in Diffusion Models Paper • 2404.14507 • Published Apr 22 • 21
SnapKV: LLM Knows What You are Looking for Before Generation Paper • 2404.14469 • Published Apr 22 • 23
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework Paper • 2404.14619 • Published Apr 22 • 124
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study Paper • 2404.14047 • Published Apr 22 • 38
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 240
HF-curated models available on Workers AI Collection A collection of models curated with Hugging Face that can be run on Cloudflare's Workers AI serverless inference platform. • 15 items • Updated Apr 2 • 50
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks Paper • 2403.14468 • Published Mar 21 • 18
IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models Paper • 2403.13535 • Published Mar 20 • 20
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation Paper • 2403.12906 • Published Mar 19 • 4
LightIt: Illumination Modeling and Control for Diffusion Models Paper • 2403.10615 • Published Mar 15 • 15
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations Paper • 2403.09704 • Published Mar 8 • 29
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis Paper • 2403.08764 • Published Mar 13 • 34
Gemma: Open Models Based on Gemini Research and Technology Paper • 2403.08295 • Published Mar 13 • 44
DragAnything: Motion Control for Anything using Entity Representation Paper • 2403.07420 • Published Mar 12 • 12
VideoMamba: State Space Model for Efficient Video Understanding Paper • 2403.06977 • Published Mar 11 • 24
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper • 2403.05530 • Published Mar 8 • 51
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation Paper • 2402.17245 • Published Feb 27 • 10
Sora Generates Videos with Stunning Geometrical Consistency Paper • 2402.17403 • Published Feb 27 • 15
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper • 2402.17177 • Published Feb 27 • 88
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions Paper • 2402.17485 • Published Feb 27 • 184
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT Paper • 2402.16840 • Published Feb 26 • 23
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs Paper • 2402.15491 • Published Feb 23 • 13
Seamless Human Motion Composition with Blended Positional Encodings Paper • 2402.15509 • Published Feb 23 • 13
Divide-or-Conquer? Which Part Should You Distill Your LLM? Paper • 2402.15000 • Published Feb 22 • 22
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Paper • 2402.15220 • Published Feb 23 • 18
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases Paper • 2402.14905 • Published Feb 22 • 81
GaussianPro: 3D Gaussian Splatting with Progressive Propagation Paper • 2402.14650 • Published Feb 22 • 6
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22 • 17
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping Paper • 2402.14083 • Published Feb 21 • 43
Music Style Transfer with Time-Varying Inversion of Diffusion Models Paper • 2402.13763 • Published Feb 21 • 9
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper • 2402.13753 • Published Feb 21 • 106
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction Paper • 2402.12712 • Published Feb 20 • 14
LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing Paper • 2402.10294 • Published Feb 15 • 20
GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting Paper • 2402.10259 • Published Feb 15 • 13
Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling Paper • 2402.10211 • Published Feb 15 • 8
Data Engineering for Scaling Language Models to 128K Context Paper • 2402.10171 • Published Feb 15 • 18