Protein Quantification Across Hundreds of Experiments: Efficient Algorithms for LC-MS Data Analysis

Zia Khan
Computer Science, Princeton University

One of the driving aims of studies that quantitatively measure gene expression across hundreds of experimental conditions and replicates is the identification of the genes and pathways affected in disease. Measurement of only gene expression falls short in one major respect: it provides an incomplete read-out of cellular physiology. Pathways affected may involve changes in overall protein abundance or changes in proportions of post-transitionally modified variants of these proteins. Quantitative proteomics aims to address this problem. The primary measurement tool of quantitative proteomics is liquid chromatography-tandem mass spectrometry (LC-MS/MS).

Despite significant advances in LC-MS/MS instrumentation, quantitative proteomics studies have been limited to small number of experimental conditions and replicates. This situation exists, in part because the computational challenge of extracting quantitative measurements from LC-MS/MS data sets is more difficult than reading quantitative measurements from a gene expression microarray. In this talk, I present algorithmic techniques drawn from computational geometry that directly address this computational challenge. We validate these techniques using several data sets with both internal and external measures of quantification accuracy. We demonstrate the scalability of these techniques using a large data set that spans a total of 472 experimental conditions and replicates.