< Back to previous page

Project

Deep learning for sound source localisation and speech emotion recognition: A perspective on representation learning and sequence modelling

Speech contains a large amount of useful information, as not only it constitutes one of the main mechanisms of human-to-human communication, but it also provides one of the indispensable modalities in human-computer interaction. In order to accommodate processing and information retrieval from speech, computational speech processing systems convert speech sound waves into one-dimensional discrete time series, i.e., the digital speech recording. However, the quality of these speech recordings is hampered by various undesirable artefacts such as reverberation, background noise, distortions due to the non-ideal response and limited numerical precision of the recording device, etc. Therefore, an effective speech information retrieval system needs to simultaneously identify and interpret meaningful temporal content in the speech recording while also resisting against interference of artefacts and irrelevant components.

Present-day deep-neural-network-based data-driven models have surpassed the human average performance in a variety of perceptual tasks, and provide powerful and applicable tools for modern speech/audio processing, including speech information retrieval. In this thesis, we propose to use deep neural network models to first retrieve features that capture high-level speech representations reflecting the intrinsic structure of the data, and then explore the temporal relationships among these features through a sequence model. We apply this modelling paradigm to two speech/audio processing tasks, namely binaural sound source localisation and speech emotion recognition.

For these two tasks, binaural sound source localisation and crosslanguage/ cross-corpus speech emotion recognition, we design distinct models to learn representations which reflect the intrinsic structure of the acquired data that is relevant to the envisaged task. In the binaural sound source localisation task, we propose a parametric embedding by defining a similarity metric in a latent space using a deep neural network architecture known as the “siamese” network. This model can be optimised to map points that are close to each other in the latent space (the space of source azimuths and elevations) to nearby points in the embedding space, thus the Euclidean distances between the embeddings reflect their source proximities, and the structure of the embeddings forms a manifold, which provides interpretability to the embeddings. We show that the proposed embedding generalises well in various acoustic conditions (with reverberation) different from those encountered during training, and provides better performance than unsupervised embeddings previously used for binaural sound source localisation. We also extend this embedding to use both supervised learning and weakly supervised learning, and show that in both conditions, the resulting embeddings perform similarly well, whereas the weakly supervised embedding allows to estimate source azimuth and elevation simultaneously.

In the cross-language speech emotion recognition task, we aim to mitigate the model performance degradation problem in cross-language and cross-corpus conditions, and propose a transfer learning method that uses a pre-trained wav2vec 2.0 model. This model can transfer the time-domain audio waveforms into a shared embedding space across different languages (i.e. 53 different languages), and it is trained in a way that contextual information is kept thus marginalising out the influence of language variability. Then, we propose a Deep-Within Class Covariance Normalisation (Deep-WCCN) layer that can be inserted into the artificial neural network model for further reducing its susceptibility to other variabilities such as speaker variability and channel variability. Experimental results show that the proposed method outperforms a baseline method that is based on common acoustic feature sets for speech emotion recognition in the within-language setting, as well as the baseline model and the state-of-the-art models for the cross-language setting. In addition, we experimentally validate the effectiveness of the Deep-WCCN, which can further improve the model performance. Finally, we show that the proposed transfer learning method exhibits good data efficiency when merging target language data into the fine-tuning process.

We also address the problem of modelling the temporal dependencies in long speech/audio sequences (especially for end-to-end learning), and propose a novel end-to-end learning deep neural network model for speech emotion recognition. This model is based on the concept of dilated causal convolution with context stacking, is parallelisable and has a receptive field as large as the input sequence length while maintaining a reasonably low computational cost. We evaluate the proposed model in speech emotion recognition regression and classification tasks, and show that it improves the recognition performance over the state-of-the-art end-to-end model. Moreover, we also study the impact of using various input representations such as the raw audio samples versus log mel-spectrograms and illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features.

Date:2 Aug 2017 →  7 Oct 2022
Keywords:User-centered, Deep Learning, Emotion Detection, Mood Disorder Prediction
Disciplines:Applied mathematics in specific fields, Modelling, Multimedia processing
Project type:PhD project