< Back to previous page

Project

Dimension reduction challenges in multi-block data analysis: Handling outlying variables and performing regression in reduced space

In many research fields, studies measure several variables for different objects and, thus, yield multivariate data. Often, these data further consist of multiple blocks that are coupled in that they share a mode. In this doctoral dissertation, we consider two types of coupled data: column-coupled data in which the objects form multiple blocks and the variables are shared, and row-coupled data in which the objects are shared but the variables consist of two separate sets. To shed light on the relations within such coupled data, we apply dimension reduction techniques. Specifically, we propose extensions of standard principal component analysis. This method decomposes multivariate data into component scores and loadings. The loadings express the relations between the variables and the components and thus play a crucial role in the interpretation of these components.

In the first part of the dissertation we focus on column-coupled multi-block data. For such data, it  often happens that most variables have similar loadings across the blocks, whereas a few variables behave differently and can therefore be called outlying. Unlike existing methods for column-coupled data, we aim to detect these outlying variables. To this end, we build on and refine the lower bound congruence method (LBCM; De Roover, Timmerman, & Ceulemans, 2017), which assesses similarity of loadings by means of Tucker’s congruence coefficient. The LBCM is an interesting heuristic as it yields objective, computation-based information that is therefore perfectly reproducible. Next to pinpointing the outlying variables, LBCM ranks all the variables according to their relative outlyingness and yields an outlyingness ranking. However, LBCM has three main drawbacks: (1) it is prone to false positives, (2) it uses the Tucker’s congruence coefficient without checking whether other popular similarity measures are better suited for outlying variable detection, and (3) since it removes the variables in a sequential and heuristic way, no insight is obtained in the correlation structure of the removed variables. We tackle these drawbacks in chapters 1 to 3. In chapter 1, project, we address the false positives issue by adding a resampled upper bound (RUB) to LBCM, yielding the LRUBCM method. In chapter 2, we investigate whether the outlyingness ranking of the variables can be improved by plugging in other similarity measures. In chapter 3, we propose the Outlying and Non-outlying Variable (ONVar) model (and algorithm) which, on top of component scores and loadings, includes a partition vector that clusters the variables into outlying and non-outlying sets. Whereas the non-outlying variables are restricted to have the same loadings across the blocks, the outlying ones are modeled with block-specific loadings.

In the second part of the dissertation we focus on row-coupled two-block data. Here, researchers are usually interested in extracting components from both data blocks simultaneously and relating them in some way to each other. We take a regression perspective, in that the two blocks pertain to predictors and criteria. In this second part, we further extend principal covariates regression (PCovR). This method simultaneously reduces the predictors to components and regresses the criteria on these components. Although earlier work showed promising simulation results, problems might arise when the number of criterion variables is high and when some of them are not related to the predictors and, thus, cannot be predicted by them. To deal with these problems, chapter 4 presents PCovR2, a new method that extends PCovR by also reducing the criteria to a few criterion components.

Date:1 Oct 2016 →  16 Sep 2020
Keywords:Psychology, Quantitative Psychology
Disciplines:Applied psychology
Project type:PhD project