< Back to previous page

Publication

Learning Salient Segments for Speech Emotion Recognition Using Attentive Temporal Pooling

Journal Contribution - Journal Article

In the temporal process of expressing the emotions, some intervals embed more salient emotion information than others. In this paper, by introducing an attentive temporal pooling module into the deep neural network (DNN) architecture, we present a simple but effective speech emotion recognition (SER) framework, which is able to automatically highlight the emotionally salient segments while suppressing the influence of less relevant ones. For an input speech utterance, the extracted feature sequence of hand-crafted low-level descriptors (LLDs) are evenly split into several overlapping temporal segments, and the segment-level features are computed by performing functionals on the LLDs of each segment. These segment-level features are then input into a DNN model outputting the emotion probabilities as well as the more condensed representation of each segment. An attentive temporal pooling module, consisting of an auxiliary DNN and a Gaussian Mixture Model (GMM), is proposed to learn the emotional saliency weights of different temporal segments from the condensed representations, which are then assigned to the segment-level emotion probabilities for the final utterance-level prediction. Notably, the attentive temporal pooling module and the DNN architecture for feature abstraction can be jointly trained using only the utterance-level labels, while without any frame-level or segment-level supervisory information. Experimental results on the three public released emotion datasets RML, EMO-DB, and IEMOCAP show that the proposed framework obtains state-of-the-art performance on SER.

Journal: IEEE Access
ISSN: 2169-3536
Volume: 8
Pages: 151740-151752
Publication year:2020
  • ORCID: /0000-0002-1774-2970/work/83442772
  • Scopus Id: 85090763801