CycleGAN-VC

Authors:
Takuhiro Kaneko, Hirokazu Kameoka

 

Description:
A voice conversion method that does not rely on parallel data, using Cycle-Consistent Adversarial Networks (CycleGANs) with gated convolutional neural networks (CNNs) and an identity-mapping loss.

 

Training and Data:
Uses CycleGAN with adversarial loss and cycle-consistency loss for training, incorporating gated CNNs for capturing sequential and hierarchical structures, and an identity-mapping loss to preserve linguistic information.

 

Advantages:
No need for parallel data, extra modules, or alignment procedures. Reduces over-smoothing and captures spectrotemporal structures effectively.

 

Limitations:
While the method showed competitive performance, there is still a quality gap between the original and converted speech.

 

Model Architecture:
The generator uses a 1D CNN for capturing overall feature relationships while preserving temporal structure, incorporating gated linear units (GLUs) as activation functions. The discriminator uses a 2D CNN to focus on 2D spectral textures.

 

Dependencies:
The method relies on CycleGAN architecture, gated CNNs, and the WORLD vocoder for speech analysis and synthesis. Implements using libraries such as PyTorch for neural networks, and optimization with the Adam optimizer.

 

Synthesis:
Converts Mel-cepstral coefficients (MCEPs), logarithmic fundamental frequency (log F0), and aperiodicities (APs) using the WORLD analysis system. MCEPs are learned with CycleGAN-VC, log F0 is converted using logarithm Gaussian normalized transformation, and APs are used without modification.

 

Dataset:
Voice Conversion Challenge 2016 (VCC 2016) dataset with manual segmentation into 216 short parallel sentences, divided into training and evaluation sets.

 

Preprocessing:
Preprocesses the speech data by downsampling to 16 kHz, extracting 24 Mel-cepstral coefficients (MCEPs), logarithmic fundamental frequency (log F0), and aperiodicities (APs) every 5 ms using the WORLD analysis system. Normalizes the source and target MCEPs per dimension before training.

 

Evaluation Metrics:

Mean Opinion Score (MOS) for naturalness and speaker similarity, Global Variance (GV), and Modulation Spectra (MS). The CycleGAN-VC model is evaluated for naturalness and speaker similarity through subjective listening tests. Objective metrics such as Global Variance (GV) and Modulation Spectra (MS) show that the converted feature sequences are close to the target. The method outperforms the GMM-based baseline in subjective evaluations for naturalness.

 

Performance:
The CycleGAN-VC model is evaluated for naturalness and speaker similarity through subjective listening tests. Objective metrics such as Global Variance (GV) and Modulation Spectra (MS) show that the converted feature sequences are close to the target. The method outperforms the GMM-based baseline in subjective evaluations for naturalness.

 

Contributions:

Introduced a parallel-data-free VC method using CycleGAN, demonstrated its effectiveness in capturing spectrotemporal structures and reducing over-smoothing, and provided a general framework applicable to various VC tasks.

Link to tool


Last Accessed: 7/10/2024

NSF Award #2346473