Speech Synthesis for Text-Based Editing of Audio Narration | Computer Science Department at Princeton University

Report ID:

TR-012-18

Authors:

Jin, Zeyu

Date:

May 21, 2018

Pages:

122

Download Formats:

[PDF]

Abstract:

Recorded audio narration plays a crucial role in many contexts including online
lectures, documentaries, demo videos, podcasts, and radio. However, editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to perform select, cut, copy and paste operations in the text transcript of the narration and apply the changes to the waveform accordingly. However such interfaces do not support the ability to synthesize new words not appearing in the transcript. While it is possible to build a high fidelity speech synthesizer based on samples of a new voice, to operate well they typically require a large amount of voice data as input as well as substantial manual annotation.
This thesis presents a speech synthesizer tailored for text-based editing of
narrations. The basic idea is to synthesize the input word in a dierent voice using
a standard pre-built speech synthesizer and then transform the voice to the desired voice using voice conversion. Unfortunately, conventional voice conversion does not produce synthesis with sufficient quality for the stated application. Hence, this thesis introduces new voice conversion techniques that synthesize words with high individuality and clarity. Three methods are proposed: the first approach is called CUTE, a data-driven voice conversion method based on frame-level unit selection and exemplar features. The second method called VoCo is built on CUTE but with several improvements that help the synthesized word blend more seamlessly into the context where it is inserted Both CUTE and VoCo select sequences of audio frames from the voice samples and stitch them together to approximate the voice being converted.
The third method improves over VoCo with deep neural networks. It involves two
networks: FFTNet generates high quality waveforms based on acoustic features, and TimbreNet transforms the acoustic feature of the generic synthesizer voice to that of a human voice.