Probabilistic Graphical Models for the Analysis and Synthesis of Musical Audio (thesis)

Report ID: TR-886-10
Author: Hoffman, Matthew
Date: 2010-10-00
Pages: 124
Download Formats: |PDF|
Abstract:

Content-based Music Information Retrieval (MIR) systems seek to automatically extract meaningful information from musical audio signals. This thesis applies new and existing generative probabilistic models to several content-based MIR tasks: timbral similarity estimation, semantic annotation and retrieval, and latent source discovery and separation.

In order to estimate how similar two songs sound to one another, we employ a Hierarchical Dirichlet Process (HDP) mixture model to discover a shared representation of the distribution of timbres in each song. Comparing songs under this shared representation yields better query-by-example retrieval quality and scalability than previous approaches.

To predict what tags are likely to apply to a song (e.g., “rap,” “happy,” or “driving music”), we develop the Codeword Bernoulli Average (CBA) model, a simple and fast mixture-of-experts model. Despite its simplicity, CBA performs at least as well as state-of- the-art approaches at automatically annotating songs and finding to what songs in a database a given tag most applies.

Finally, we address the problem of latent source discovery and separation by developing two Bayesian nonparametric models, the Shift-Invariant HDP and Gamma Process NMF. These models allow us to discover what sounds (e.g. bass drums, guitar chords, etc.) are present in a song or set of songs and to isolate or suppress individual source. These models’ ability to decide how many latent sources are necessary to model the data is particularly valuable in this application, since it is impossible to guess a priori how many sounds will appear in a given song or set of songs.

Once they have been fit to data, probabilistic models can also be used to drive the synthesis of new musical audio, both for creative purposes and to qualitatively diagnose what information a model does and does not capture. We also adapt the SIHDP model to create new versions of input audio with arbitrary sample sets, for example, to create a sound file that matches a song as closely as possible by combining spoken text.