Authors:
Chris Donahue, Julian McAuley, Miller Puckette
Chris Donahue, Julian McAuley, Miller Puckette
Description:
WaveGAN is an approach to unsupervised synthesis of raw-waveform audio using GANs, generating one-second audio slices with global coherence, suitable for sound effects.
WaveGAN is an approach to unsupervised synthesis of raw-waveform audio using GANs, generating one-second audio slices with global coherence, suitable for sound effects.
Training and Data:
GAN training involves a generator creating audio samples and a discriminator evaluating them against real samples, optimized using WGAN-GP.
GAN training involves a generator creating audio samples and a discriminator evaluating them against real samples, optimized using WGAN-GP.
Advantages:
Fully parallelizable audio generation, efficient training and inference, and ability to generate diverse sound effects without supervision.
Fully parallelizable audio generation, efficient training and inference, and ability to generate diverse sound effects without supervision.
Limitations:
Potential artifacts in generated audio, limited to one-second slices, and requires further development for longer or more complex audio.
Potential artifacts in generated audio, limited to one-second slices, and requires further development for longer or more complex audio.
Model Architecture:
WaveGAN modifies the DCGAN architecture for audio, using 1D convolutions with longer filters and increased stride, and incorporating phase shuffle in the discriminator.
WaveGAN modifies the DCGAN architecture for audio, using 1D convolutions with longer filters and increased stride, and incorporating phase shuffle in the discriminator.
Dependencies:
Uses WGAN-GP for training stability, relies on transposed convolution for upsampling, and incorporates phase shuffle to prevent discriminator overfitting.
Uses WGAN-GP for training stability, relies on transposed convolution for upsampling, and incorporates phase shuffle to prevent discriminator overfitting.
Synthesis Process:
The synthesis process uses transposed convolutions with 1D filters to convert low-dimensional latent vectors into high-dimensional audio waveforms.
The synthesis process uses transposed convolutions with 1D filters to convert low-dimensional latent vectors into high-dimensional audio waveforms.
Dataset:
Speech Commands Zero Through Nine (SC09) dataset, along with other datasets like drum sound effects, bird vocalizations, piano, and large vocab speech (TIMIT).
Speech Commands Zero Through Nine (SC09) dataset, along with other datasets like drum sound effects, bird vocalizations, piano, and large vocab speech (TIMIT).
Preprocessing:
Audio is preprocessed into suitable formats, including short-time Fourier transform and normalization for spectrogram-based generation.
Audio is preprocessed into suitable formats, including short-time Fourier transform and normalization for spectrogram-based generation.
Evaluation Metrics:
Inception score, nearest neighbor comparisons, and human evaluation metrics like accuracy, sound quality, ease of intelligibility, and speaker diversity.
Inception score, nearest neighbor comparisons, and human evaluation metrics like accuracy, sound quality, ease of intelligibility, and speaker diversity.
Performance:
WaveGAN produced diverse and intelligible speech digits, with promising results in other domains like drums and bird vocalizations.
WaveGAN produced diverse and intelligible speech digits, with promising results in other domains like drums and bird vocalizations.
Contributions:
Introduces WaveGAN for unsupervised audio synthesis, demonstrated its effectiveness across various audio domains, and highlighted the challenges and potential solutions.
Introduces WaveGAN for unsupervised audio synthesis, demonstrated its effectiveness across various audio domains, and highlighted the challenges and potential solutions.