< Terug naar vorige pagina

Publicatie

Addressing Limitations of Language Models

Boek - Dissertatie

Addressing Limitations of Language Models Applications that automatically process language and/or speech are numerous, including, but not limited to: Automatic Speech Recognition (ASR), Machine Translation, Speech Translation, Spelling Correction, Natural Language Understanding and Natural Language Generation. Every application typically needs a special-purpose system and datasets with task-specific labels, but there is one common ground between all these applications: the Language Model (LM). LMs can tell you which word sequences are likely in a certain language and domain and which are not, and they can predict the most likely word(s) following a specific context. This is an important property in a system that automatically processes language, since it can help in disambiguating the intrinsic ambiguities in language. Progress in language modeling can thus lead to progress in many other areas as well. The goal of this PhD thesis is to address limitations of existing Language Models (LMs), thereby focusing on two of the most widely used paradigms, n-gram LMs and neural LMs. One of the main issues in language modeling is data sparsity: the number of words and valid word combinations is in principle infinite, and most language modeling paradigms can only give reliable probability estimates for frequent combinations. N-gram LMs, that make the simplifying assumption that it is possible to predict the next word based on the n – 1 previous words solely, suffer severely from this problem, since they represent words as discrete units and as such they cannot represent similarities between words. We propose a method to alleviate this problem by generating new n-grams based on valid syntactic and morphological transformations of existing n-grams. We exploit the fact that in Dutch, the word order in head clauses and subordinate clauses is different, and the fact that the conjugations of most verbs can be derived based on simple rules. The LMs trained on the expanded data are tested both intrinsically by evaluating them on text and in a downstream application, namely ASR. While n-gram LMs are still preferred in some applications because of their low latency and low computational cost in evaluation, neural LMs are the current state of the art in language modeling accuracy. Neural LMs suffer less from data sparsity because they represent words as continuous vectors, and similar words have vectors that are close to each other because they occur in similar contexts. However, if words share some morphological (e.g. the same suffix) or orthographic (e.g. starting with a capital) properties, these properties cannot automatically be derived from their vector representations, even though they can be derived easily from the surface form of the word. Thus, we propose to combine the word vector with vectors of the characters in the word to explicitly encode this type of information. Our character-word neural LM shows improvements with respect to a word-level LM, is better at modeling out-of-vocabulary words since the characters provide additional information, and has less trainable parameters which reduces the size of the model. Neural LMs, and more specifically Recurrent Neural Network (RNN) LMs have longer memory than n-gram LMs, because they have a recurrent connection which not only feeds the current input word to the network, but also the state of the previous timestep, which can in principle encode the entire previous context. In practice however, the memory is limited, since it has been shown that combining the LM with a cache model, which keeps track of a limited number of previous words, still improves the performance of the LM. We make an extensive comparison of two types of cache models, evaluating the combination of the LM with the cache both intrinsically and in ASR, and show that the more advanced cache model does not always lead to the best performance in ASR. Based on the intuition that the cache is most informative for (infrequent) content words, we propose a novel method of combining the LM and cache probabilities, by giving relatively more weight to the cache for content words. Additionally, we propose to only add content words to the cache, such that no space is wasted on function words and the cache effectively has a longer memory. Finally, we address a disadvantage of RNN LMs which is shared by all applications involving Neural Networks (NN)s, namely the fact that they are hard to interpret. We present a novel method to analyze memory in RNN models: we calculate state gradients to measure the influence of the input on the state of the RNN. The gradient matrix is decomposed with Singular Value Decomposition (SVD) to investigate which directions of the input space have an influence on the state space and to what extent. The advantage of our framework is that we can analyze how well properties encoded in the embedding space are remembered by the RNN without knowing beforehand what those properties are. We demonstrate the effectiveness of our method on a synthetic dataset and investigate the influence of changing several hyperparameters such the type and size of network and training data on the memory of LMs trained on natural language data. We also propose two methods to examine how well a specific (syntactic) property is remembered by the LM, and we show that LMs exhibit certain linguistic intuitions.
Jaar van publicatie:2019
Toegankelijkheid:Open