< Back to previous page

Project

Identifying redundant variables in a simultaneous component analysis.

In our information society in many fields of science it occurs more and more (1) that information pertaining to a large amount of variables is present and (2) that information is available about the same set of entities (e.g. genes, persons, and companies), which may originate from different sources. In the field of bio-informatics, for example, while exposing an organism to a range of experimental conditions, often the intensity of different sets of metabolites are measured, with each set, which may contain many metabolites, being obtained with a different method (e.g. gas and liquid chromatography). In order to disclose the mechanisms underlying such data, a simultaneous analysis is needed of the information that is present in the different sources. To this end, the family of simultaneous component models may be used, because in these models components are extracted simultaneously from all the information that is present in the different sources. In the bio-informatics example, these components may capture the underlying processes that are responsible for the biochemical funcitoning, under different experimental conditions, of the mechanism under study, as measured by means of metabolites. An often encountered problem when analyzing such data, which may originate from different sources and which may have a large amount of variables, is that many variables may contain redundant information. This is problematic because analyzing such data sets often yields components that may be very hard to interpret, which implies that no deeper insight may be gained into the underlying mechanisms that are represented by these components. Moreover, with respect to follow-up research, datasets with redundant variables imply a large cost of measuring, both financially and with respect to time. The goal of this research project is to develop new methods for simultaneous component analysis that enhance the disclosure of the mechanisms that are underlying to data sets originating from different sources, by means of a (limited) set of sparse components (i.e., having only a low number of variables loading on them), which are easy to interpret.
Date:1 Oct 2010 →  30 Sep 2011
Keywords:Data fusion, Variable selection, Simultaneous components, Component analysis, Multi-set data analysis
Disciplines:Applied mathematics in specific fields, Statistics and numerical methods, Applied psychology