Kei Akuzawa, Kotaro Onishi, Keisuke Takiguchi, Kohki Mametani, Koichiro Mori


Novel voice conversion (VC) method using a Conditional Deep Hierarchical Variational Autoencoder (CDHVAE) to improve the naturalness and similarity of converted speech without requiring parallel corpora or text transcriptions.


Training and Data:
Trained on VCTK corpus with utterances from 20 speakers (10 females, 10 males). Mel-spectrograms as acoustic features. MelGAN as vocoder. Batch size: 8. Epochs: 200. Hyperparameter β values tested: 1, 10, 50.


Requires no parallel corpora or text transcriptions, fast conversion speed, and high model expressiveness leading to improved naturalness and similarity in converted speech.


The performance could be further improved by integrating auxiliary losses and the model may need fine-tuning of hyperparameters for optimal performance in different scenarios.


Model Architecture:
The model employs a deep hierarchical VAE (DHVAE) with hierarchical latent variables, modified from NVAE architecture by incorporating Conditional Instance Normalization (CIN) for conditioning on speaker labels.


Utilizes β-VAE objective, Conditional Instance Normalization (CIN), and a vocoder (MelGAN) trained on the same corpus for converting mel-spectrograms to audio.


Voice Conversion:
Voice conversion is performed by encoding the source speech into speaker-invariant latent variables and decoding it with the target speaker’s characteristics using a non-autoregressive decoder.


VCTK corpus, consisting of utterances from 20 speakers (10 females and 10 males), with mel-spectrograms extracted from 48kHz audio used as acoustic features.


The input mel-spectrograms are resized to fit the model’s input dimensions (C = 1, H = 80, W = T = 40) and speaker labels are embedded using linear transformation for CIN.


Evaluation Metrics:
Mean Opinion Scores (MOS) for naturalness and similarity, rated by human subjects on Amazon Mechanical Turk, along with measurements of KL divergence and reconstruction error.


The proposed CDHVAE achieved MOS scores higher than 3.5 for both naturalness and similarity in inter-gender settings, outperforming existing autoencoder-based VC methods.


Introduced CDHVAE, a high expressiveness VAE model for voice conversion, demonstrated the importance of model expressiveness and the β-VAE objective, and provided a thorough evaluation of the model’s performance.
NSF Award #2346473