< Back to previous page

Project

Distributional Semantics meets Visual Analytics

So-called token-level distributional semantic models a.k.a. word sense induction in Computational Linguistics, as presented by Schütze (1998), can be used for usage-based lexical semantic corpus studies (Geeraerts 2010, pp. 173-179). To be able to operate them however, we first need more insight in how these black box models operate and how they should be calibrated towards more theoretically grounded (socio)linguistic questions.

This research project focusses on two approaches to gain these insights in Schütze’s (1998) established yet intuitive bag-of-words distributional model. The first one is a statistical approach that allows to evaluate a large amount of different parameter settings (and consequently models) against manually assigned sense labels. This manual labelling should be regarded as a step-up to fully unsupervised modelling with respect to lexical semantic research questions. The idea is that semantic distances computed by each model are evaluated in terms of how well tokens of the same class can be separated from the rest. Subsequently, this “separability index” is put into a mixed-effect linear regression model as the response variable, with the parameter settings of the model as the predictors.

The second approach entails applying visual analytics as a complementary approach to the statistical modelling of the parameter space. By visualizing the semantic space in multilayered interactive scatter plots, we give the expert user access to approximate 2D representation of the high-dimensional similarity matrix produced by each individual model. By allowing visual encoding of the variables, complex interactions and selections, these models can be visually compared in a so-called scatterplot matrix. The visualization tool can be used in different ways, including eyeballing the appropriate models for a specific semantic phenomenon (i.e. polysemy) or to go to the most detailed level for an error analysis of individual, misrepresented tokens.

 

References

Geeraerts, Dirk. Theories of lexical semantics. Oxford University Press, 2010.

Schütze, Hinrich. "Automatic word sense discrimination." Computational linguistics 24.1 (1998): 97-123.

Date:15 Mar 2012 →  16 May 2022
Keywords:lexical semantics, corpus linguistics, distributional semantics
Disciplines:Theory and methodology of literary studies
Project type:PhD project