MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data Paper • 2406.18790 • Published 2 days ago • 18
ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning Paper • 2406.14130 • Published 9 days ago • 10
HelpSteer2: Open-source dataset for training top-performing reward models Paper • 2406.08673 • Published 16 days ago • 14
Stable Diffusion 3 Collection Stable Diffusion 3 and related models for text-to-image and image-to-image • 2 items • Updated 17 days ago • 66
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration Paper • 2406.01014 • Published 26 days ago • 29
view article Article Launching the Artificial Analysis Text to Image Leaderboard & Arena 23 days ago • 5
CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models Paper • 2405.13974 • Published May 22 • 7
Searching Priors Makes Text-to-Video Synthesis Better Paper • 2406.03215 • Published 24 days ago • 11
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes Paper • 2406.02897 • Published 24 days ago • 12
3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting Paper • 2405.18424 • Published May 28 • 7
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability Paper • 2405.14129 • Published May 23 • 9
RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance Paper • 2405.14677 • Published May 23 • 8
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis Paper • 2405.14224 • Published May 23 • 8
LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models Paper • 2405.14477 • Published May 23 • 15
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation Paper • 2405.14598 • Published May 23 • 11
ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models Paper • 2405.09220 • Published May 15 • 23
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model Paper • 2405.09215 • Published May 15 • 14
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots Paper • 2405.07990 • Published May 13 • 15
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts Paper • 2405.07518 • Published May 13 • 22
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models Paper • 2403.13372 • Published Mar 20 • 58
Self-Play Preference Optimization for Language Model Alignment Paper • 2405.00675 • Published May 1 • 19
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge Paper • 2405.00263 • Published May 1 • 13
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound Paper • 2405.00233 • Published Apr 30 • 12
Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting Paper • 2404.19758 • Published Apr 30 • 10
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation Paper • 2404.19752 • Published Apr 30 • 20
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation Paper • 2404.19427 • Published Apr 30 • 69
DressCode: Autoregressively Sewing and Generating Garments from Text Guidance Paper • 2401.16465 • Published Jan 29 • 10
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting Paper • 2404.18911 • Published Apr 29 • 29
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models Paper • 2404.17672 • Published Apr 26 • 18
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published Apr 29 • 67
MaPa: Text-driven Photorealistic Material Painting for 3D Shapes Paper • 2404.17569 • Published Apr 26 • 11
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 33
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis Paper • 2404.13686 • Published Apr 21 • 26
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 240
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Paper • 2404.13208 • Published Apr 19 • 38
CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting Paper • 2404.09458 • Published Apr 15 • 6
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Paper • 2404.08801 • Published Apr 12 • 62
Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video Paper • 2404.09833 • Published Apr 15 • 29
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents Paper • 2404.05902 • Published Apr 8 • 20
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples Paper • 2404.07544 • Published Apr 11 • 15
Transferable and Principled Efficiency for Open-Vocabulary Segmentation Paper • 2404.07448 • Published Apr 11 • 10