< Back to previous page

Project

Automating Data Wrangling

Data is everywhere, and being able to efficiently analyse and interpret this data allows companies to, amongst other thigns, save money and increase productivity. Existing tools allow users with varying skills levels to perform complex tasks, such as clustering, predictive modeling and forecasting, without writing code, by automatically selecting and configuring suitable algorithms for each task. These tools, however, assume that the data is in an appropriate shape; a single table in which all values are appropriately formatted. This assumption is often violated in practice, as a lot of data is stored in semi-structured formats, such as spreadsheets, and are formatted for human readability rather than for consumption by algorithms. Getting this data into a suitable shape is therefore time consuming, even for experienced data scientists, with up to 80% of time spent on data preparation tasks like restructuring tables and formatting values. Existing tools aimed at reducing the effort to transform various data, such as spreadsheets, into appropriate formats, however, make the strong assumption that users already know exactly what the final data should look like, both in terms of layout and formatting. The barrier for non-experts to get started with analysis thus remains high, and even experts are still spending most of their time on the preparation steps when presented with ill-formatted data. As part of the ERC Advanced Grant SYNTH project, the team of professor Luc De Raedt has already shown that predictive approaches are promising for suggesting effective layouts and formats for tables and values, respectively. The goal of this project is to leverage research prototypes with functionality required for practical use, an effort guided by a number of use cases, and to have all components in place for building a minimal viable product and pursue a spin-off based on this technology. With our solution, we thus aim to close the gap between having access to data, and quickly being able to generate value from this data, by allowing users to focus on its content, rather than on its structure and format.
Date:1 Jan 2022 →  31 Dec 2023
Keywords:artificial intelligence, machine learning, data wrangling
Disciplines:Data mining