Probabilistic Graphical Models for the Analysis and Synthesis of Musical Audio (thesis) | Computer Science Department at Princeton University

Report ID:

TR-886-10

Authors:

Hoffman, Matthew

Date:

September 2010

Pages:

124

Download Formats:

[PDF]

Abstract:

Content-based Music Information Retrieval (MIR) systems seek to automatically extract
meaningful information from musical audio signals. This thesis applies new and existing
generative probabilistic models to several content-based MIR tasks: timbral similarity
estimation, semantic annotation and retrieval, and latent source discovery and separation.

In order to estimate how similar two songs sound to one another, we employ a Hierarchical
Dirichlet Process (HDP) mixture model to discover a shared representation of the
distribution of timbres in each song. Comparing songs under this shared representation
yields better query-by-example retrieval quality and scalability than previous approaches.

To predict what tags are likely to apply to a song (e.g., “rap,” “happy,” or “driving
music”), we develop the Codeword Bernoulli Average (CBA) model, a simple and fast
mixture-of-experts model. Despite its simplicity, CBA performs at least as well as state-of-
the-art approaches at automatically annotating songs and finding to what songs in a
database a given tag most applies.

Finally, we address the problem of latent source discovery and separation by developing
two Bayesian nonparametric models, the Shift-Invariant HDP and Gamma Process NMF.
These models allow us to discover what sounds (e.g. bass drums, guitar chords, etc.) are
present in a song or set of songs and to isolate or suppress individual source. These models’
ability to decide how many latent sources are necessary to model the data is particularly
valuable in this application, since it is impossible to guess a priori how many sounds will
appear in a given song or set of songs.

Once they have been fit to data, probabilistic models can also be used to drive the
synthesis of new musical audio, both for creative purposes and to qualitatively diagnose
what information a model does and does not capture. We also adapt the SIHDP model
to create new versions of input audio with arbitrary sample sets, for example, to create a
sound file that matches a song as closely as possible by combining spoken text.