Using Structural Information in Machine Learning Applications (thesis)
Classification problems encountered in real-life applications often have domain-specific structural information available on the measured data, which cannot be readily accommodated by conventional machine learning algorithms. Ignoring the structure and blindly running a conventional algorithm on the numerical data can compromise the quality of solutions.
This thesis provides answers to two such complementary settings: one where there is a hierarchy among multiple class labels (output structure), and one where the input features are known to be sequentially correlated (input structure). Probabilistic graphical models are used to encode the dependencies, and model parameters are estimated using efficient inference algorithms. While both scenarios are motivated by real bioinformatics problems, namely gene function prediction and aneuploidy-based cancer classification, they have applications in other domains as well, such as computer graphics, music, and text classification.
The first part focuses on structure among a group of output classes.
Large numbers of overlapping classes are found to be organized in hierarchies in many domains. In multi-label classification over such a hierarchy, members of a class must also belong to all of its parents. Training an independent classifier for each class is a common approach, but this may yield labels for a given example that collectively violate this constraint. We propose a principled method of resolving such inconsistencies to increase accuracy over all classes. Our approach is to view the hierarchy as a graphical model, and then to employ Bayesian inference to infer the most likely set of hierarchically consistent class labels from independent base classifier predictions. This method is applicable over any type of base classification algorithm. Experiments on synthetic data, as well as real data sets from bioinformatics and computer graphics domains, illustrate its behavior under a range of conditions, and demonstrate that it is able to improve accuracy at all levels of a hierarchy.
The second part focuses on structure among input features, in the form of a sequential relationship. Generic non-sequential machine learning models assume no importance in the order of inputs.
Conversely, sequence models (e.g. Hidden Markov Models) need to assume stationarity to keep the number of parameters manageable, modeling only sequence-wide stability and losing the significance of particular positions. We propose a fixed-length sequence classification method that combines sequential correlations with positional features in a sparsely regularized solution, with training and inference algorithms in linear-time of sequence length. Motivated by the problem of tumor classification by genetic copy number changes, our method can identify copy number alteration regions in noisy array-CGH data, and locate the genes of clinical relevance driving these alterations and affecting the cancer label. Experiments on synthetic array-CGH data modeled from real human breast tumors, as well as real tumor datasets from breast cancer, bladder cancer, and uveal melanoma, demonstrate that the our method matches or exceeds state-of-the-art methods in accuracy, and is able to produce biologically significant predictions for clinically relevant genes.