audio - a zzfive Collection

zzfive 's Collections

3d

image

LLMs

video

agent

cv

audio

robot

audio

updated 12 days ago

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Paper • 2405.18503 • Published May 28 • 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Paper • 2405.20289 • Published May 30 • 10
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Paper • 2406.02897 • Published Jun 5 • 13
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

Paper • 2406.03344 • Published Jun 5 • 18
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Paper • 2406.11768 • Published Jun 17 • 20
Towards Robust Speech Representation Learning for Thousands of Languages

Paper • 2407.00837 • Published Jun 30 • 10
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Paper • 2407.01494 • Published Jul 1 • 13
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Paper • 2407.02869 • Published Jul 3 • 18
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Paper • 2407.04051 • Published Jul 4 • 35
Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10 • 16
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Paper • 2407.10387 • Published Jul 15 • 6
Qwen2-Audio Technical Report

Paper • 2407.10759 • Published Jul 15 • 55
Audio Conditioning for Music Generation via Discrete Bottleneck Features

Paper • 2407.12563 • Published Jul 17 • 5
Stable Audio Open

Paper • 2407.14358 • Published Jul 19 • 23
Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Paper • 2407.14329 • Published Jul 19 • 4
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Paper • 2407.15060 • Published Jul 21 • 9
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Paper • 2407.21646 • Published Jul 31 • 18
Open-Vocabulary Audio-Visual Semantic Segmentation

Paper • 2407.21721 • Published Jul 31 • 8
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Paper • 2408.01337 • Published Aug 2 • 10
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Paper • 2408.01708 • Published Aug 3 • 3
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Paper • 2408.03588 • Published Aug 7 • 6
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Paper • 2408.04708 • Published Aug 8 • 5
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Paper • 2408.07547 • Published Aug 14 • 7
Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

Paper • 2408.08019 • Published Aug 15 • 9
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29 • 46
The VoxCeleb Speaker Recognition Challenge: A Retrospective

Paper • 2408.14886 • Published Aug 27 • 8
FLUX that Plays Music

Paper • 2409.00587 • Published Sep 1 • 31
Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders

Paper • 2409.00391 • Published Aug 31 • 4
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

Paper • 2409.02245 • Published Sep 3 • 9
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10 • 55
SongCreator: Lyrics-based Universal Song Generation

Paper • 2409.06029 • Published Sep 9 • 20
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Paper • 2409.06135 • Published Sep 10 • 14
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Paper • 2409.09214 • Published Sep 13 • 46
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Paper • 2409.10819 • Published Sep 17 • 17
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Paper • 2409.10831 • Published Sep 17 • 4
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Paper • 2409.12139 • Published Sep 18 • 11
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Paper • 2409.08425 • Published Sep 12 • 9
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

Paper • 2409.12962 • Published Sep 19 • 2
MuCodec: Ultra Low-Bitrate Music Codec

Paper • 2409.13216 • Published Sep 20 • 22
Temporally Aligned Audio for Video with Autoregression

Paper • 2409.13689 • Published Sep 20 • 7
Distilling an End-to-End Voice Assistant Without Instruction Training Data

Paper • 2410.02678 • Published Oct 3 • 22
Roadmap towards Superhuman Speech Understanding using Large Language Models

Paper • 2410.13268 • Published 24 days ago • 33
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Paper • 2410.12957 • Published 25 days ago • 7
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Paper • 2410.15316 • Published 21 days ago • 10
Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published 20 days ago • 28
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Paper • 2409.00750 • Published Sep 1 • 2