< Back to previous page

Project

Clustering Time Series

The notion of similarity is becoming more and more important in data analysis.
Comparing two objects quantitatively is a fundamental building block of most
machine learning algorithms, be it supervised or unsupervised. While for simple
mathematical objects, the notion of similarity might be intuitively clear (when
comparing how different two integers are, for example, we can just look at
the absolute difference between them), the situation becomes very cloudy very
quickly as we look at more complex data types. In this thesis, we focus on data
evolving throughout time, known as signals or time series.
The past few decades have seen an incredible increase in time series data.
Larger data storage capacity along with new technological solutions to industrialproblems has led to more and more data being generated in industrial contexts,
as exemplified by large equipment in a busy factory having all sorts of sensors
monitored by a control room engineer, or a wind turbine recording information on
wind speed and electricity output, as well as internal data such as temperatures
of different components.
Also on the level of nation-wide, or even cross-nation structures like the electricity
grid, the storing, processing and transmission of information is crucial to make
the different parts and pieces work together. This has given rise to thingslike smart meters, which monitor energy usage of every household on a very
fine-grained scale, or the European transnational electricity network, where
data for all the different power lines is necessary to clear the international
energy market. More and more, even, individuals are connecting themselves up
with different sensors, generating terabytes of detailed data on their movement,
interactions, and even medical parameters (blood pressure, heart rate, ...).
Given these evolutions (and some of them will return later on in the text
as illustrations and applications of the theoretical work done in this research
project), it is evident that similarity measures for time series are a timely
and relevant topic. It is also true that it is inherently challenging: the right distance measure depends on the specific application and what we are trying
to accomplish. Traditional vector distances, that compare the shapes of two
signals, or more statistical measures, comparing the distribution of data points
in the series can be sensible, but even very simple tasks, such as discerning a sine
wave from a Gaussian noise signal, can be deceptively hard, as we show later on
in the text. There is thus an ever-increasing need for similarity measures that
base their quantitative comparison on interesting, sensible and interpretable
aspects of the time series.
This need is well-understood in fields of engineering such as signal processing
and control systems theory, where a signal is seen as being produced by some
underlying dynamical system. The system is the fundamental object of interest,
and the signal is just a particular (often noisy) realization that allows us to
interact with the underlying system. While this readily allows interpretation and
comparison of different dynamics in any way we please, explicitly estimating or
finding mathematical models for these underlying dynamics often is an arduous,
resource-intensive, and computationally expensive task.
In this thesis we investigate a distance measure, known as the weighted cepstral
distance measure, that can be calculated from just the time series data alone, yet
can be interpreted in terms of the underlying dynamics of the signal. In this way,
we can quantify similarity between two different dynamical systems, without
having to explicitly construct mathematical models for them and estimate
their model parameters. In its original formulation, this measure made some
constraining assumptions on the underlying dynamics of two signals it tried
to compare. The work presented here relaxes these assumptions, allowing for
general linear time invariant single-input single-output (LTI SISO) systems with
stochastic or deterministic inputs, and makes progress in extending it further
to multiple-input multiple-output (MIMO) systems, by extending some of the
theoretical notions used by the distance measure.
We develop a clustering framework based on the extended versions of the
weighted cepstral distance measure, and provide interpretations for the cluster
centers and the variance within a cluster, in terms of the underlying dynamics.
We show that this clustering algorithm solves the problem of time series
subsequence clustering, a notoriously hard challenge in time series research.
The theoretical notions, claims, and techniques are amply tested and shown to
work in many real-life applications, the most ambitious one being an industrial
project in collaboration with Électricité De France (EDF), a multinational
utility company, on the clearance of international energy markets.
In doing so, we exhibit the applicability of the techniques developed here,
and show that the work presented contributes meaningfully to both academic
research on time series analysis, and to industrial solutions for real-life problems.

Date:2 Sep 2014 →  14 Jan 2021
Keywords:clusteren, tijdreeksen
Disciplines:Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences, Modelling, Biological system engineering, Signal processing, Control systems, robotics and automation, Design theories and methods, Mechatronics and robotics, Computer theory
Project type:PhD project