Topic modeling

Much of my research is in topic models, which are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. These algorithms help us develop new ways to search, browse and summarize large archives of texts.

Below, you will find links to introductory materials, corpus browsers based on topic models, and open source software (from my research group) for topic modeling.

Introductory materials

The topic models mailing list is a good forum for discussing topic modeling.

Corpus browsers based on topic models

The structure uncovered by topic models can be used to explore the otherwise unorganized collection: dividing documents according to their topics and using the hidden structure to determine similarity between documents.

To build your own browsers, see Allison Chaney's excellent Topic Model Visualization Engine (TMVE). For example, here is a browser of 100,000 Wikipedia articles that uses TMVE.

The following are some other browsers of large collections of documents built with topic models. (These were not built with TMVE.)

Also see Sean Gerrish's discipline browser for an interesting application of topic modeling at JSTOR.

Topic modeling software

Our research group has released many open-source software packages for topic modeling. Please post questions, comments, and suggestions about this code to the topic models mailing list.

Link Model/Algorithm Language Author Notes
lda-c Latent Dirichlet allocation C David Blei This implements variational inference for LDA.
online lda Online inference for LDA Python Matt Hoffman Fits topic models to massive data. The demo downloads random Wikipedia articles and fits a topic model to them.
tmve Topic Model Visualization Engine Python Allison Chaney A package for creating corpus browsers. See, for example, Wikipedia .
hdp Hierarchical Dirichlet processes C++ Chong Wang These are topic models where the data determine the number of topics. This implements Gibbs sampling in HDPs for text.
dtm Dynamic topic models and the infuence model C++ Sean Gerrish and David Blei This implements topics that change over time and a model of how individual documents predict that change.
ctm-c Correlated topic models C David Blei This implements variational inference for the CTM.
hlda Hierarchical latent Dirichlet allocation C David Blei This implements a topic model that finds a hierarchy of topics. The structure of the hierarchy is determined by the data.
class-slda Supervised topic models for classifiation C++ Chong Wang Implements supervised topic models with a categorical response.
lda R package for Gibbs sampling in many models R Jonathan Chang Implements many models and is fast . Supports LDA, RTMs (for networked documents), MMSB (for network data), and sLDA (with a continuous response).
turbotopics Turbo topics Python David Blei Turbo topics find significant multiword phrases in topics.