< Back to previous page

Publication

Program Synthesis for Automated Data Wrangling

Book - Dissertation

Abstract:Data is everywhere, but properly formatted data is not. A common process that anyone working with data has to go through, is correctly formatting it. Different types of downstream tasks require different types of correct. Users then have to know what the target format is and how to perform the required transformations for getting there, neither of which are always straightforward. Especially for novice users in environments with a lot of flexibility with respect to formatting, such as spreadsheets, this is often a cumbersome task. The goal of this thesis is to develop algorithms that help the user in performing these formatting or data wrangling tasks. First, we use program synthesis to automatically reshape tabular data into attribute-value format, and allow users to provide feedback by coloring cells. Second, we use program synthesis to automatically extract relevant features for supervised machine learning, and use performance on these machine learning tasks to drive the program synthesis. Finally, we leverage large language models to enable semantic transformations in string transformations.
Publication year:2023
Accessibility:Open