WaveGAN

Authors:
Chris Donahue, Julian McAuley, Miller Puckette

 

Description:
WaveGAN is an approach to unsupervised synthesis of raw-waveform audio using GANs, generating one-second audio slices with global coherence, suitable for sound effects.

 

Training and Data:
GAN training involves a generator creating audio samples and a discriminator evaluating them against real samples, optimized using WGAN-GP.

 

Advantages:
Fully parallelizable audio generation, efficient training and inference, and ability to generate diverse sound effects without supervision.

 

Limitations:
Potential artifacts in generated audio, limited to one-second slices, and requires further development for longer or more complex audio.

 

Model Architecture:
WaveGAN modifies the DCGAN architecture for audio, using 1D convolutions with longer filters and increased stride, and incorporating phase shuffle in the discriminator.

 

Dependencies:
Uses WGAN-GP for training stability, relies on transposed convolution for upsampling, and incorporates phase shuffle to prevent discriminator overfitting.

 

Synthesis Process:
The synthesis process uses transposed convolutions with 1D filters to convert low-dimensional latent vectors into high-dimensional audio waveforms.

 

Dataset:
Speech Commands Zero Through Nine (SC09) dataset, along with other datasets like drum sound effects, bird vocalizations, piano, and large vocab speech (TIMIT).

 

Preprocessing:
Audio is preprocessed into suitable formats, including short-time Fourier transform and normalization for spectrogram-based generation.

 

Evaluation Metrics:
Inception score, nearest neighbor comparisons, and human evaluation metrics like accuracy, sound quality, ease of intelligibility, and speaker diversity.

 

Performance:
WaveGAN produced diverse and intelligible speech digits, with promising results in other domains like drums and bird vocalizations.

 

Contributions:
Introduces WaveGAN for unsupervised audio synthesis, demonstrated its effectiveness across various audio domains, and highlighted the challenges and potential solutions.
Link to paper


Last Accessed: 7/18/2024

NSF Award #2346473