< Back to previous page

Project

Deepening the Methodology behind Data Integration and Dimensionality Reduction: Application in Life Sciences

The problems of high dimensionality and heterogeneity of data always raise
lots of challenges in computational biology and chemistry. As the size of data
sets increase, as well their complexity, dimensionality reduction and advanced
analytics will gain its importance. The past 10 years or so, data integration has
become an active area of research in the field of machine learning, bioinformatics
and chemoinformatics.Several dimensionality reduction and data integration methods are currently
available for analyzing and classifying biological data. In the first part of
this thesis, we concentrate on dimensionality reduction techniques such as
the Generalized Eigenvalue Decomposition (GEVD) and Robust Principal
Component Analysis (RPCA). We will investigate the generalized eigenvalue
decomposition (GEVD) in a maximum likelihood setting, in which we employ
a technique relying on the generalization of the singular value decomposition
(SVD). We will elaborate the similarity between maximum likelihood estimation
via a generalized eigenvalue decomposition (MLGEVD) and generalized ridge
regression. This relationship reveals an important mathematical property of
GEVD in which one of the matrices acts as prior information in the model
development. Later we present GEVD for the integration of microarray and
literature information. Then robust PCA (RPCA) is applied on a weighted
matrix for the identification of differentially expressed genes of colon cancer.
In the second part of the thesis, we propose a data-driven bandwidth selection
criterion for kernel PCA (KPCA), which is a non-linear dimensionality reduction
technique. We center our discussion on feature selection/transformation
techniques in medical diagnostics. We show how to build stable, robust and
interpretable classifiers on non-linear data.
In the third part of the thesis we investigate a machine learning approach,
a weighted LS-SVM classifier to integrate two data sources. This algorithm
offers a single mathematical framework for data integration and classification problems, hence providing solutions for many real bioinformatics applications.
Finally, based on PCA, we define new chemical descriptors from the connectiontableof chemical compounds. In addition, we develop a new machine learning
approach for the identification of biofilm inhibitors of Salmonella Typhimurium
and Pseudomonas aeruginosa. Here, PCA converts the connection-table ofeach compound into a structural descriptor of two vectors: one corresponding
to atoms and the other to bonds. As a supervised classification algorithm,
a weighted least squares support vector machine is used in which a table
enumerating the atoms is weighted against a table enumerating the bonds. We
apply this framework to a given experimental data set on activity of collection
of compounds against Salmonella and Pseudomonas biofilms. This trained
model predicts the activity of new compounds on these biofilms.

 

Date:28 Sep 2009 →  25 Apr 2017
Keywords:Machine Learning, Bioinformatics, Chemoinformatics, Data Integration, Dimensionality Reduction
Disciplines:Other engineering and technology
Project type:PhD project