Better & Faster Large Language Models via Multi-token Prediction Paper • 2404.19737 • Published Apr 30 • 73
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization Paper • 2404.09956 • Published Apr 15 • 11
Audio Dialogues: Dialogues dataset for audio and music understanding Paper • 2404.07616 • Published Apr 11 • 15
Gecko: Versatile Text Embeddings Distilled from Large Language Models Paper • 2403.20327 • Published Mar 29 • 47
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis Paper • 2404.03204 • Published Apr 4 • 7
Improving Text-to-Image Consistency via Automatic Prompt Optimization Paper • 2403.17804 • Published Mar 26 • 16
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper • 2403.05530 • Published Mar 8 • 59
Teaching Large Language Models to Reason with Reinforcement Learning Paper • 2403.04642 • Published Mar 7 • 46
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper • 2402.17764 • Published Feb 27 • 602
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models Paper • 2403.03100 • Published Mar 5 • 34
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper • 2402.17177 • Published Feb 27 • 88
Music Style Transfer with Time-Varying Inversion of Diffusion Models Paper • 2402.13763 • Published Feb 21 • 9
ChatMusician: Understanding and Generating Music Intrinsically with LLM Paper • 2402.16153 • Published Feb 25 • 56
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion Paper • 2402.10009 • Published Feb 15 • 18
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data Paper • 2402.08093 • Published Feb 12 • 54
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like Paper • 2402.07383 • Published Feb 12 • 13