WaveGAN

Authors:
Chris Donahue, Julian McAuley, Miller Puckette

Description:
WaveGAN is an approach to unsupervised synthesis of raw-waveform audio using GANs, generating one-second audio slices with global coherence, suitable for sound effects.

Training and Data:
GAN training involves a generator creating audio samples and a discriminator evaluating them against real samples, optimized using WGAN-GP.

Advantages:
Fully parallelizable audio generation, efficient training and inference, and ability to generate diverse sound effects without supervision.

Limitations:
Potential artifacts in generated audio, limited to one-second slices, and requires further development for longer or more complex audio.

Model Architecture:
WaveGAN modifies the DCGAN architecture for audio, using 1D convolutions with longer filters and increased stride, and incorporating phase shuffle in the discriminator.

Dependencies:
Uses WGAN-GP for training stability, relies on transposed convolution for upsampling, and incorporates phase shuffle to prevent discriminator overfitting.

Synthesis Process:
The synthesis process uses transposed convolutions with 1D filters to convert low-dimensional latent vectors into high-dimensional audio waveforms.

Dataset:
Speech Commands Zero Through Nine (SC09) dataset, along with other datasets like drum sound effects, bird vocalizations, piano, and large vocab speech (TIMIT).

Preprocessing:
Audio is preprocessed into suitable formats, including short-time Fourier transform and normalization for spectrogram-based generation.

Evaluation Metrics:
Inception score, nearest neighbor comparisons, and human evaluation metrics like accuracy, sound quality, ease of intelligibility, and speaker diversity.

Performance:
WaveGAN produced diverse and intelligible speech digits, with promising results in other domains like drums and bird vocalizations.

Contributions:
Introduces WaveGAN for unsupervised audio synthesis, demonstrated its effectiveness across various audio domains, and highlighted the challenges and potential solutions.

Link to paper

Last Accessed: 7/18/2024

NSF Award #2346473

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

College of Engineering and Information Technology

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

WaveGAN

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Subscribe to UMBC Weekly Top Stories

I am interested in: