VoCo

Authors:
Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi, Jingwan Lu, Adam Finkelstein

 

Description:
VoCo is a text-based audio editing tool that allows users to replace or insert words in audio narrations seamlessly. The system synthesizes new words by stitching together snippets of audio from elsewhere in the narration, making the edited audio sound natural and consistent.

 

Training and Data:
The training process involves aligning the target speaker’s speech samples to the transcript using a forced alignment algorithm. This method converts the transcript to phoneme sequences and establishes the mapping between phonemes and speech samples in time.

 

Advantages:
Simplifies audio editing by allowing text-based modifications; produces high-quality, natural-sounding edits; reduces the time and effort needed for audio editing, making it accessible to non-experts.

 

Limitations:
Optimizing the energy function to better match human perception; extending the method to synthesize longer phrases with natural prosody; incorporating features optimized for seamless audio stitching.

 

Model Architecture:
The model architecture includes a TTS synthesizer for generating the initial word and a voice conversion system that uses dynamic triphone preselection, exchangeable triphones, and range selection to convert the TTS-generated word into the target voice.

 

Dependencies:
Uses forced alignment for phoneme sequence mapping; utilizes a TTS synthesizer for initial word generation; employs voice conversion techniques such as dynamic triphone preselection, exchangeable triphones, and range selection.

 

Synthesis Process:
The synthesis process uses a TTS synthesizer to generate the new word in a generic voice, then converts it to match the target voice using voice conversion. Range selection and dynamic triphone preselection are used to select optimal audio snippets for smooth transitions.

 

Dataset:
CMU Arctic dataset, which includes recordings of multiple voices. The dataset used in the experiments contains recordings segmented by phonemes for both training and synthesis.

 

Preprocessing:
Aligning the target speaker’s speech samples to the transcript using forced alignment; extracting phoneme sequences and establishing mappings between phonemes and speech samples in time.

 

Evaluation Metrics:
Mean Opinion Score (MOS) tests to evaluate the quality of synthesized words; identification tests to measure how often synthesized words are indistinguishable from original recordings; comparison with human synthesis to evaluate the efficiency and effectiveness of VoCo compared to manual editing by audio experts.

 

Performance:
VoCo’s synthesized words were often rated as more natural than baseline methods; manual editing by experts took significantly longer and resulted in lower quality edits compared to VoCo; users often could not distinguish VoCo’s edits from original recordings, especially after manual refinement.

 

Contributions:
Introduces a novel method for text-based insertion and replacement in audio narration; improves synthesis quality with exchangeable triphones and dynamic triphone preselection; provides a user-friendly interface for text-based audio editing; demonstrates significant improvements over existing methods in terms of synthesis quality and editing efficiency.
Link to paper


Last Accessed: 7/18/2024

NSF Award #2346473