Yunyun Wang will present her FPO "Generative Universal Models for Speech" on Friday, May 30, 2025 at 2:00 PM in CS 302 & Zoom.
Zoom link: https://princeton.zoom.us/j/9195627075?pwd=a1N5a2VzYy92cEFMVjFTZUZnOGpiUT09
The members of Yunyun’s committee are as follows:
Examiners: Adam Finkelstein (Adviser), Szymon Rusinkiewicz, Felix Heide
Readers: Danqi Chen, Zeyu Jin (Adobe Research)
A copy of her thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.
Everyone is invited to attend her talk.
Abstract follows below:
This thesis presents a comprehensive framework for controllable speech synthesis through self-supervised generative modeling. We propose Generative Universal Models for Speech (GUMS), a system that decomposes speech into disentangled representations—speaker embeddings, acoustic embeddings, and content representations, and reconstructs it using a synthesis model. Our approach enables detailed control over speaker voice, environmental acoustics, speech content, and speaking rate.
We introduce three key representation models. First, GR0 learns global speaker embeddings by disentangling them from time-varying local content without requiring speaker labels. Second, we develop content representation models AIC and GUMS Codec that capture speech content in continuous and quantized forms, respectively. The AIC model enforces speaker and pitch invariance through the alteration invariant content loss. GUMS Codec builds on the speech codec model DAC, incorporating residual vector quantization along with speaker and pitch conditioning. The result is a highly compact, discrete, and language-independent representation that is well-suited for manipulation, control, and efficient transmission.
We then integrate these representations into a high-fidelity speech synthesis model, DiTVC, based on a Diffusion Transformer architecture. DiTVC enables direct prompting using target speaker audio instead of relying on fixed embeddings, allowing for more expressive voice conversion and robust prosody control. By combining these models, we achieve controllable, high-quality speech synthesis using unlabeled, in-the-wild data. The unified framework advances both representation learning and generation, offering an interpretable, and editable approach to speech synthesis.