Knowledge Transfer for Solving the Data Scarcity Problem for Machine Learning in Boinformatics
Ms. Qian Xu
PhD Thesis Presentation
Many new results are being obtained in bioinformatics through the application of machine learning and data mining, which allow the discovery of useful knowledge from biological datasets and reveal the underlying mechanisms in various genomics and proteomics problems. High-quality machine learning models critically depend on the availability of data, especially for supervised learning techniques. However, in many biological domains, high-quality labeled data are often in short supply, partly due to the high cost of data collection experiments in biology. In this thesis, we consider several approaches to alleviate this data scarcity problem. We focus on two fundamental problems in proteomics: protein subcellular localization and protein-protein interaction. Although both problems have been widely studied in bioinformatics in the past, the lack of labeled data makes the application of machine learning solutions to these problems infeasible. We exploit two novel ideas from machine learning to tackle the protein subcellular localization problem. First, we present a semi-supervised learning method that incorporates a large amount of un-annotated protein data and a small amount of labeled data to build a robust and accurate predictive model. Second, recognizing that common knowledge in different biological datasets can be extracted and shared among different learning tasks, we adopt a multi-task learning method to learn different tasks together. This effectively alleviates the data scarcity problem for any individual task. We apply this method to several tasks of predicting protein subcellular localization prediction. In the latter problem, our algorithm successfully borrows and transfers useful knowledge from auxiliary protein-protein interaction networks to our target interaction network. The connection between the two networks reflects the similarities between protein entities and the properties of network topologies. We further consider a collective matrix factorization method for inferring the unobserved interactions in a target network with the aid of an auxiliary interaction network. Finally, we consider the quantitative structure-activity relationship inference problem, which is an important problem for the in-silico drug design approach and a closely related topic to protein subcellular localization and protein-protein interaction. A popular solution is to model chemical compounds as graphs and to exploit different graph kernels to incorporate the sequential, structural and chemical information. To avoid designing of specific graph kernels, a novel graph matching based method is presented. The idea is motivated by the intuition that instead of providing manually constructed kernels for graphs, we can instead use learning to help us find the underlying similarity functions that best resemble the training set. The method thus requires a set of pairs of graphs and their corresponding matching matrices as the training data. However, in practice, such a set of training set is hard to get. Different from approaches above, we employ a state-of-the-art graph alignment method to generate a set of pairs of graphs and their corresponding matching and then use them as ground truth to learn a graph matching function. These solutions represent one of the first efforts in using knowledge transfer from auxiliary domains to help solve a target learning problem in bioinformatics when labeled data are scarce, thus opening up new opportunities for more effective applications of machine learning and data mining in bioinformatics.
This web site is maintained by the
Department of Chemical and Biological Engineering.
Last updated: 2 Nov, 2010