Project
Towards Declarative Statistical Inference
Wide-ranging digitalization has made it possible to capture increasingly larger amounts of data. In order to transform this raw data into meaningful insights, data analytics and statistical inference techniques are essential. However, while it is expected that a researcher is an expert in their own field, it is not self-evident that they are also proficient in statistics. In fact, it is known that statistical inference is a labor-intensive and error-prone task. This dissertation aims to understand current statistical inference practices for the experimental evaluation of machine learning algorithms, and proposes improvements where possible. It takes a small step forward towards the goal of automating the data analysis component of empirical research, making the process more robust in terms of correct execution and interpretation of the results.
Our first contribution is a synthesis of existing knowledge about error estimation of supervised learning algorithms with cross-validation. We highlight the distinction between model and learner error, and investigate the effect of repeating cross-validation on the quality of the error estimate.
Next, we focus on the evaluation of multi-instance learning algorithms. Here, instances are not labeled individually, but instead are grouped together in bags and only the bag label is known. Our second contribution is an investigation of the extent to which conclusions about bag-level performance can be generalized to the instance-level. Our third contribution is a meta-learning experiment in which we predict the most suitable multi-instance learner for a given problem.
The intricate nature of statistical inference begs the question whether this aspect of research cannot be automated. One requirement for this is the availability of a model representing all relevant characteristics of the population under study. Bayesian networks are a candidate for this, as they concisely describe the joint probability distribution of a set of random variables, and come with a plethora of efficient inference methods. Our last contribution is a theoretical proposal of a greedy-hill climbing structure learning algorithm for Bayesian networks.