StarGAN-VC

Authors:
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

 

Description:
A method that allows non-parallel many-to-many voice conversion (VC) using Star Generative Adversarial Networks (StarGAN).

 

Training and Data:
Uses StarGAN with adversarial loss, cycle-consistency loss, domain classification loss, and identity mapping loss for training. The generator takes an acoustic feature sequence and a target attribute label as inputs.

 

Advantages:
No need for parallel data, transcriptions, or alignment procedures. Simultaneously learns many-to-many mappings using a single generator. Capable of real-time implementations and requires only a few minutes of training data.

 

Limitations:
TBD

 

Model Architecture:
The generator consists of encoder and decoder networks using gated CNNs. The discriminator and domain classifier also use gated CNNs to process acoustic feature sequences and generate probability distributions.

 

Dependencies:
The method relies on StarGAN architecture, gated CNNs, and the WORLD vocoder for speech analysis and synthesis. Implements using libraries such as PyTorch for neural networks, and optimization with the Adam optimizer.

 

Synthesis:
Converts Mel-cepstral coefficients (MCCs), logarithmic fundamental frequency (log F0), and aperiodicities (APs) using the WORLD vocoder. MCCs are learned with StarGAN-VC, log F0 is converted using logarithm Gaussian normalized transformation, and APs are used without modification.

 

Dataset:
Voice Conversion Challenge (VCC) 2018 dataset with recordings of US English speakers, manually segmented into training and evaluation sets.

 

Preprocessing:
Preprocesses the speech data by downsampling to 22050 Hz, extracting 36 Mel-cepstral coefficients (MCCs), logarithmic fundamental frequency (log F0), and aperiodicities (APs) every 5 ms using the WORLD analysis system. Normalizes the source and target MCCs per dimension before training.

 

Evaluation Metrics:
Mean Opinion Score (MOS) for naturalness, AB test for sound quality, and ABX test for speaker similarity.

 

Performance:
The StarGAN-VC model is evaluated for naturalness and speaker similarity through subjective listening tests. The method significantly outperformed the VAE-GAN baseline in subjective evaluations for both sound quality and speaker similarity.

 

Contributions:
Introduced StarGAN-VC for non-parallel many-to-many voice conversion, demonstrating its effectiveness in achieving high-quality converted speech without requiring parallel data, and providing a general framework applicable to various VC tasks.
Link to paper


Last Accessed: 7/10/2024

NSF Award #2346473