Baidu Silicon Valley AI Lab cloned audio (Neural Voice Cloning with a Few Samples)

Authors:
Sercan Ö. Arık (sercanarik@baidu.com)
Jitong Chen (chenjitong01@baidu.com)
Kainan Peng (pengkainan@baidu.com)
Wei Ping (pingwei01@baidu.com)
Yanqi Zhou (yanqiz@baidu.com)

Institution:
Baidu Research, 1195 Bordeaux Dr. Sunnyvale, CA 94089

Abstract:
Voice cloning is a highly desired feature for personalized speech interfaces. This paper introduces a neural voice cloning system that learns to synthesize a person’s voice from only a few audio samples. Two approaches are studied: speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a multi-speaker generative model, while speaker encoding trains a separate model to directly infer a new speaker embedding. Both approaches achieve good performance in terms of naturalness and similarity to the original speaker, even with a few cloning audios. Speaker adaptation offers slightly better naturalness and similarity, but speaker encoding is more favorable for low-resource deployment due to less cloning time and required memory.

Data Creation Method:
The data involves using multi-speaker datasets (LibriSpeech and VCTK) to train a generative model capable of synthesizing speech in the voice of any speaker from a few audio samples. The cloning audios are randomly sampled from these datasets for each experiment.

Number of Speakers:

LibriSpeech: 2484 speakers
VCTK: 108 speakers

Total Size:

6 hours

Number of Real Samples:

Number of Fake Samples:

Description of the Dataset:

The dataset consists of audio recordings and their corresponding transcriptions used to train and evaluate the neural voice cloning system. The system can convert a few audio samples from an unseen speaker into a synthesized voice that mimics the original speaker. The dataset includes both training data for the multi-speaker generative model and cloning samples for evaluation.

Extra Details:
The dataset utilizes speaker adaptation and speaker encoding techniques. Speaker adaptation involves fine-tuning a pre-trained multi-speaker model, while speaker encoding estimates the speaker embedding directly from cloning audios. The models are evaluated based on naturalness and speaker similarity using both subjective human evaluations and automated metrics like speaker classification accuracy and equal error rate (EER).

Data Type:

Audio recordings and transcriptions

Average Length:

Keywords:

Voice Cloning, Speech Synthesis, Speaker Adaptation, Speaker Encoding, Neural Networks

When Published:

20th February 2018

Annotation Process:
Annotations include transcriptions used to train the multi-speaker generative model and the cloned audios. These annotations are necessary for evaluating the naturalness and speaker similarity of the synthesized speech.

Usage Scenarios:
Voice cloning for personalized speech interfaces, creating expressive audiobooks, generating voices for virtual assistants, and research in speech synthesis and speaker adaptation.

Data Accessibility:
Publicly accessible

Miscellaneous Information:
The dataset supports advanced voice cloning techniques such as speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a pre-trained model, while speaker encoding estimates speaker embeddings directly from audio samples. The models are evaluated using mean opinion scores (MOS) and speaker verification metrics.

Credits:
Datasets Used:

LibriSpeech
VCTK

Speech Synthesis Models Referenced:

Dataset Link

Main Paper Link

License Link

Last Accessed: 6/20/2024

NSF Award #2346473

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

College of Engineering and Information Technology

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Baidu Silicon Valley AI Lab cloned audio (Neural Voice Cloning with a Few Samples)

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Subscribe to UMBC Weekly Top Stories

I am interested in: