< Back to previous page

Project

Towards Declarative Statistical Inference

Wide-ranging digitalization has made it possible to capture increasingly larger amounts of data. In order to transform this raw data into meaningful insights, data analytics and statistical inference techniques are essential. However, while it is expected that a researcher is an expert in their own field, it is not self-evident that they are also proficient in statistics. In fact, it is known that statistical inference is a labor-intensive and error-prone task. This dissertation aims to understand current statistical inference practices for the experimental evaluation of machine learning algorithms, and proposes improvements where possible. It takes a small step forward towards the goal of automating the data analysis component of empirical research, making the process more robust in terms of correct execution and interpretation of the results.

Our first contribution is a synthesis of existing knowledge about error estimation of supervised learning algorithms with cross-validation. We highlight the distinction between model and learner error, and investigate the effect of repeating cross-validation on the quality of the error estimate.

Next, we focus on the evaluation of multi-instance learning algorithms. Here, instances are not labeled individually, but instead are grouped together in bags and only the bag label is known. Our second contribution is an investigation of the extent to which conclusions about bag-level performance can be generalized to the instance-level. Our third contribution is a meta-learning experiment in which we predict the most suitable multi-instance learner for a given problem.

The intricate nature of statistical inference begs the question whether this aspect of research cannot be automated. One requirement for this is the availability of a model representing all relevant characteristics of the population under study. Bayesian networks are a candidate for this, as they concisely describe the joint probability distribution of a set of random variables, and come with a plethora of efficient inference methods. Our last contribution is a theoretical proposal of a greedy-hill climbing structure learning algorithm for Bayesian networks.

Date:7 Sep 2010 →  14 Dec 2017
Keywords:data mining, machine learning
Disciplines:Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences
Project type:PhD project