< Back to previous page

Publication

Token-based vector space models as semantic control in lexical lectometry

Book - Dissertation

Type-based distributional semantics as embodied in vector space models has proven to be a successful method for the retrieval of near-synonyms in large corpora. These words have then been used as variants of lexical sociolinguistic variables (e.g.: return and winst for the concept profit) in lectometric studies, that is, the aggregate-level study of lexical distances between linguistic varieties, in particular pluricentric languages such as Dutch. However, a limitation of type-based vector space models is that all senses of a word are lumped together into one vector representation, making it harder to control for polysemy and subtle contextual distinctions. In addition, operating at the lexeme level, these type-based vector space models are not able to pick out the relevant corpus occurrences that are the input for the lectometric distance calculations. The main goal of this PhD project is to introduce token-based vector space models in lexical lectometric research, in order to gain better semantic control during the composition of lexical variables. Token-based vector space models address the abovementioned shortcomings by disambiguating different senses of lexical variants. This technique is able to model the semantics of individual tokens (i.e. 'usage occurrences') of a word in a corpus and to represent them as token clouds in multidimensional vector space, with clusters of tokens revealing distinct senses of the word. By superimposing the token clouds of the lexical variants, one can distinguish which meanings are shared by near-synonyms and determine the 'semantic envelope of variation' of the lexical alternation. For instance, the variant return is polysemous in Netherlandic Dutch, with the two readings 'profit' and 'return game', but not in Belgian Dutch, where only the 'profit' sense is found. By isolating the cluster of tokens with the meaning 'profit' one can identify the near-synonymous tokens of the variants winst and return. The fine-tuning of vector space model-based lectometry targeted in this PhD contributes to the scaling up of lexical variationist research, by providing methods for dealing with corpora whose size exceeds manual analysis. At the same time, token-based models comply with the need of detailed analysis by allowing to zoom in on the behavior of individual tokens in order to determine more subtle contextual distinctions. This PhD project is situated in a larger research endeavor ("Nephological Semantics - Using token clouds for meaning detection in variationist linguistics", BOF C1 project 3H150305) that aims at detailed understanding of token-based vector representations for lexical, semantic and variational research.
Publication year:2019
Accessibility:Open