UniAudio: An Audio Foundation Model Toward Universal Audio Generation Paper • 2310.00704 • Published Oct 1, 2023 • 19
Structural Similarities Between Language Models and Neural Response Measurements Paper • 2306.01930 • Published Jun 2, 2023 • 2
Streaming Transformer ASR with Blockwise Synchronous Beam Search Paper • 2006.14941 • Published Jun 25, 2020 • 2
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models Paper • 2403.14438 • Published Mar 21 • 2
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions Paper • 1712.05884 • Published Dec 16, 2017 • 2
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild Paper • 2403.16973 • Published Mar 25 • 2
Masked Audio Generation using a Single Non-Autoregressive Transformer Paper • 2401.04577 • Published Jan 9 • 42
WavLLM: Towards Robust and Adaptive Speech Large Language Model Paper • 2404.00656 • Published Mar 31 • 10
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis Paper • 2404.03204 • Published Apr 4 • 7
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Paper • 2311.07919 • Published Nov 14, 2023 • 9
Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion Paper • 2311.14836 • Published Nov 24, 2023 • 2
Audio Dialogues: Dialogues dataset for audio and music understanding Paper • 2404.07616 • Published Apr 11 • 15
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization Paper • 2404.09956 • Published Apr 15 • 11
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound Paper • 2405.00233 • Published Apr 30 • 13
LLM-AD: Large Language Model based Audio Description System Paper • 2405.00983 • Published May 2 • 16
Images that Sound: Composing Images and Sounds on a Single Canvas Paper • 2405.12221 • Published May 20 • 1
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos Paper • 2406.08407 • Published Jun 12 • 24
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds Paper • 2407.01494 • Published Jul 1 • 13
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation Paper • 2407.02869 • Published Jul 3 • 18
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper • 2407.04051 • Published Jul 4 • 35
Audio Conditioning for Music Generation via Discrete Bottleneck Features Paper • 2407.12563 • Published Jul 17 • 5
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation Paper • 2408.03588 • Published Aug 7 • 6
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands Paper • 2408.11048 • Published Aug 20 • 4
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling Paper • 2408.16532 • Published Aug 29 • 47