< Back to previous page

Project

Building personalised machine learning models in health informatics with limited datasets

Healthcare services are being transformed by technological advancements and the availability of health-related data, from wearable device monitoring to treatment personalisation. Machine learning (ML) has the potential to harness this data by identifying patterns and developing prediction models to assist stakeholders and, ultimately, improve healthcare. The applications of machine learning in healthcare have grown exponentially, from drug discovery to preventative health. Given enough data, machine learning models can accurately predict or classify a disease. ML models can learn from longitudinal data collected over time and make predictions early enough to allow for the implementation of any necessary interventions.

Healthcare data, however, are susceptible to certain challenges that make the ML modelling difficult. In this dissertation, we would like to address some of the major challenges such as (i) the limited availability of data due to a small data corpus or the necessity to predict events in advance, (ii) personalisation of ML models that cater at an individual level as opposed to a one-size-fits-all approach, (iii) preserving privacy of an individual while maintaining a specific performance, and (iv) problems arising from missing data and how to handle them.

To demonstrate the pervasiveness of these challenges, a variety of healthcare applications are chosen. These applications encompass diverse health monitoring scenarios at an individual or institutional level. The modeling of weight gain in pregnant women during the course of their pregnancy to ensure a healthy pregnancy and postpartum life is an example of outside-hospital preventative health monitoring. Furthermore, an application from a hospital setting is explored with the goal of predicting cognitive decline in Alzheimer's patients using a longitudinal dataset comprised of various data sources. Also, we investigate the prediction of infant mortality in a developing country from the perspective of population health management. Furthermore, we attempt to model the pain experienced by individuals performing repetitive tasks at work over time. The majority of these use cases require early prediction so that essential intervention can be carried out on time. As a result, it is imperative to develop machine learning models that can learn with only a few measurements from an individual.

The research developed in this thesis aims to address four research questions : (1) Can we predict a patient's health state with limited patient-specific time series data, (2) Can we detect infant mortality using structured tabular data with a very high percentage of missing data, (3) Can we create personalized machine learning models that can adapt over time to generate accurate predictions using few data points, and (4) Can we create an ML technique that protects sensitive raw data without sacrificing performance? These broad research questions are further subdivided into individual application-based sub-objectives. To address these research questions and sub-objectives, we developed a number of techniques that can handle both N-dimensional time-series data and tabular data.

First, we propose a straightforward method for overcoming the limited availability of individual data where the underlying principle is to learn a non-person specific ML model from all the available individuals and then personalising it with the target user's available data. 

Second, we propose a more complex method that follows the similar principle and combines a localised method for generating informative priors before learning a regression model. The localised method selects individuals from the training data, whose health history is similar to that of the individual of interest. This is then followed by a powerful Gaussian processes-based method of learning from the selected subset.

Thirdly, we offer a privacy-preserving learning paradigm based on the aggregation of ML models learned from an individual's data. This strategy differs from the conventional centralised technique in which raw data is collected and shared to a central server. The findings of this study demonstrated that a good privacy-performance trade-off is feasible.

Through a case study, we discuss the current flaws in handling missing data with off-the-shelf techniques. We demonstrate the importance of identifying the mechanisms through which the data can be missed, as well as the inconsistencies that can creep into the model if these mechanisms are not properly studied during the exploratory phase. The findings of this case study suggest a technique for detecting biased features, which, if not handled carefully, can give the ML model a false sense of predictive power. In conclusion, the concepts presented in this doctoral dissertation are relevant to addressing difficulties in modelling healthcare-related tasks using machine learning.

Date:22 Jun 2018 →  13 Jan 2023
Keywords:Activity recognition, Health Estimation, Smart Living, Internet of Things, Machine Learning
Disciplines:Sensors, biosensors and smart sensors, Other electrical and electronic engineering
Project type:PhD project