SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation Paper • 2405.18503 • Published May 28 • 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation Paper • 2405.20289 • Published May 30 • 10
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes Paper • 2406.02897 • Published Jun 5 • 13
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning Paper • 2406.03344 • Published Jun 5 • 18
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Paper • 2406.11768 • Published Jun 17 • 20
Towards Robust Speech Representation Learning for Thousands of Languages Paper • 2407.00837 • Published Jun 30 • 10
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds Paper • 2407.01494 • Published Jul 1 • 13
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation Paper • 2407.02869 • Published Jul 3 • 18
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper • 2407.04051 • Published Jul 4 • 35
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity Paper • 2407.10387 • Published Jul 15 • 6
Audio Conditioning for Music Generation via Discrete Bottleneck Features Paper • 2407.12563 • Published Jul 17 • 5
Efficient Audio Captioning with Encoder-Level Knowledge Distillation Paper • 2407.14329 • Published Jul 19 • 4
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation Paper • 2407.15060 • Published Jul 21 • 9
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent Paper • 2407.21646 • Published Jul 31 • 18
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models Paper • 2408.01337 • Published Aug 2 • 10
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation Paper • 2408.01708 • Published Aug 3 • 3
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation Paper • 2408.03588 • Published Aug 7 • 6
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency Paper • 2408.04708 • Published Aug 8 • 5
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation Paper • 2408.07547 • Published Aug 14 • 7
Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization Paper • 2408.08019 • Published Aug 15 • 9
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling Paper • 2408.16532 • Published Aug 29 • 46
The VoxCeleb Speaker Recognition Challenge: A Retrospective Paper • 2408.14886 • Published Aug 27 • 8
Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders Paper • 2409.00391 • Published Aug 31 • 4
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation Paper • 2409.02245 • Published Sep 3 • 9
LLaMA-Omni: Seamless Speech Interaction with Large Language Models Paper • 2409.06666 • Published Sep 10 • 55
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis Paper • 2409.06135 • Published Sep 10 • 14
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation Paper • 2409.09214 • Published Sep 13 • 46
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer Paper • 2409.10819 • Published Sep 17 • 17
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing Paper • 2409.10831 • Published Sep 17 • 4
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models Paper • 2409.12139 • Published Sep 18 • 11
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer Paper • 2409.08425 • Published Sep 12 • 9
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions Paper • 2409.12962 • Published Sep 19 • 2
Distilling an End-to-End Voice Assistant Without Instruction Training Data Paper • 2410.02678 • Published Oct 3 • 22
Roadmap towards Superhuman Speech Understanding using Large Language Models Paper • 2410.13268 • Published 24 days ago • 33
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization Paper • 2410.12957 • Published 25 days ago • 7
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant Paper • 2410.15316 • Published 21 days ago • 10
Continuous Speech Synthesis using per-token Latent Diffusion Paper • 2410.16048 • Published 20 days ago • 28
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer Paper • 2409.00750 • Published Sep 1 • 2