< Terug naar vorige pagina

Publicatie

Robust and sparse statistical models for high-dimensional data

Boek - Dissertatie

High-dimensional data analysis has become an indispensable part of modern statistics. Due to technological advancements, data are collected with a growing number of features. To handle the high dimensionality, we often make the assumption of sparsity that most of the measurement features are non-informative in describing the data structure that we are interested in. This leads to the popular research area: statistical learning with sparsity. Moreover, in practice the massively collected data can be easily contaminated by outliers. Hence, robust estimators for high-dimensional data are particularly needed. Here, we consider two contamination models: the Tukey-Huber Contamination Model (THCM) and the Independent Contamination Model (ICM). THCM tries to identify cases that are entirely outlying while ICM assumes that each entry of an observation is contaminated independently. Outliers that follow THCM and ICM are called rowwise outliers and cellwise outliers, respectively.Sparse principal component analysis is used to obtain stable and interpretable principal components from high-dimensional data. To handle potential rowwise outliers, we propose a robust method, called Least Trimmed Squares Sparse Principal Component Analysis (LTS-SPCA). The proposed method is based on the Least Trimmed Squares Principal Component Analysis (LTS-PCA) method which provides robust but non-sparse principal component estimates. To obtain sparse solutions, LTS-SPCA incorporates a regularization penalty on the loading vectors. The principal directions are determined sequentially to avoid that outliers in the principal component subspace destroy the sparse structure of the loadings. Simulation studies and real data examples show that the new method gives accurate estimates, even when the data is highly contaminated. Moreover, compared to existing robust sparse PCA methods the computation time is reduced to a great extent.Then, we propose a robust variable screening approach, called Robust Factor Profiled Sure Independence Screening (RFPSIS), for ultra-high dimensional regression problem when the data contains potential rowwise outliers. The new method is built on the idea of Sure Independence Screening (SIS) and Factor Profiled Sure Independence Screening (FPSIS). SIS is a fast variable selection procedure for ultra-high dimensional regression analysis. Unfortunately, its performance greatly deteriorates with increasing dependence among the predictors. However, factor profiling can be applied to remove the correlations among the predictor variables by projecting the data onto the orthogonal complement of the subspace spanned by a few latent factors. To handle the potential outliers, we use LTS-PCA to estimate the latent factors and the factor profiled variables robustly. Variable screening is then performed on factor profiled variables with the regression MM-estimator. Different types of outliers in this model and their roles in variable screening are studied. Both simulation studies and a real data analysis show that the proposed procedure has good performance on clean data and outperforms the two nonrobust methods on contaminated data.Finally, considering the cellwise outlier paradigm can be more suitable for high-dimensional data, a sparse and robust regression estimator, called PenSS, is proposed. It is a penalized version of the shooting-S estimator, which is a non-sparse robust regression estimator for cellwise outliers. We modify the original shooting-S algorithm based on a new estimator, called two-stage-MM estimator. Compared with existing estimators, it can better deal with bad leverage points in a univariate regression model. The PenSS estimator combined with the Subset, Lasso, SCAD and MCP penalty functions are studied. A fast algorithm is proposed to solve the PenSS estimator which updates the primal and dual variables alternately. Moreover, a continuation strategy is used to initialize the algorithm for a set of decreasing tuning parameters. Simulation studies and a real data analysis show that the proposed estimator works well on both clean and contaminated data. Compared with existing methods, the PenSS estimator has superior performance not only for cellwise outliers but also for rowwise outliers.
Jaar van publicatie:2019
Toegankelijkheid:Open