< Back to previous page

Project

Robust and sparse statistical methods for actuarial sciences

This PhD thesis consists of two parts and in the first part, we focus on robust statistics. More specifically, we consider robust regression when the response variable follows a distribution in the double exponential family. Hence, by means of a generalized linear model (GLM) based on covariates, we will robustly estimate the expected value as well as the dispersion. The latter itself can be of interest, but taking the dispersion into account also increases the correctness of the confidence interval for the mean. We call this estimator the robust double exponential (RDE) estimator, which is a Fisher-consistent M-estimator. We show how we can keep the influence function bounded and also derive the asymptotic distribution for this estimator. Next, we propose a robust generalized quasi-deviance, which forms the basis of a stable robust test. We can use this test to check for example the presence of over- or underdispersion. Simulations for both binomial and Poisson models demonstrate the excellent performance of the RDE estimator and associated robust tests.

We also develop penalized versions of the RDE estimator for sparse estimates in case of high-dimensional data and for flexible estimates via generalized additive models (GAMs). These extensions are based on the weighted least squares representation of the RDE estimator. Finally, real data examples illustrate the importance of robust inference for dispersion effects in GLMs and GAMs.

The second part of this thesis deals with the concordance probability, which is a robust performance measure for a model. It corresponds to the probability that a randomly chosen comparable pair of observations with their predictions is a concordant pair. However, for large data sets, it takes a very long time to naively consider and verify every possible pair. We therefore propose two approaches: the so-called marginal and k-means approximation. Based on a comprehensive simulation study, we conclude that for observations that follow a continuous distribution, the smallest approximation error is made by the k-means approximation, which has also the fastest calculation time. However, when the observations follow a binary distribution, the marginal approximation is the fastest and most accurate one.

In the insurance industry, there are two types of models that play an important role in determining the premium of an insurance contract. The frequency models predict the expected frequency of claims that will be received on one hand, while on the other hand the severity models predict how much the average claim will cost. The concordance probability is a very suitable measure to determine the performance of such models, since we mainly want to distinguish the large from the small risks. Therefore, the classical concordance probability is modified to take into account the duration of each contract, which plays an important role in the frequency models. Moreover, the weighted-mean-plot is introduced, in which local concordance probabilities are displayed as a function of the duration. The adjusted concordance probability is applied in this thesis to two real data sets from the insurance world.

The concordance probability lies between 0 and 1 since it is a probability, which means that small deviations from the real value are undesirable. Therefore, we finally propose an efficient algorithm to exactly calculate the concordance probability. This method is based on the well-known mergesort algorithm and has a linearithmic calculation time, which is a tremendous improvement over the naive, quadratic implementation of the concordance probability. Due to the short computation time it is now also possible to use the concordance probability in the fitness function of a machine learning algorithm.

All mentioned algorithms are available in various R packages with accompanying manuals at https://github.com/JolienPonnet.

Date:3 Oct 2018 →  21 Sep 2022
Keywords:Statistics, Robust
Disciplines:Applied mathematics in specific fields, Statistics and numerical methods, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences
Project type:PhD project