Publication

A multi-scale multi-attention network for dynamic facial expression recognition

Journal Contribution - Journal Article

Characterizing spatial information and modelling temporal dynamics of facial images are key challenges for dynamic facial expression recognition (FER). In this paper, we propose an end-to-end multi-scale multi-attention network (MSMA-Net) for dynamic FER. In our model, the spatio-temporal features are encoded at two scales, i.e. the entire face and local facial patches. For each scale, we adopt a 2D convolutional neural network (CNN) to capture frame-based spatial information, and a 3D CNN to depict the short-term dynamics in the temporal sequence. Moreover, we propose a multi-attention mechanism by considering both spatial and temporal attention models. The temporal attention is applied on the image sequence to highlight expressive frames within the whole sequence, and the spatial attention mechanism is applied at the patch level to learn salient facial features. Comprehensive experiments on publicly available datasets (Af-Wild2, RML, and AFEW) show that the proposed MSMA-Net model automatically highlights salient expressive frames, within which salient facial features are learned, allowing better or very competitive results compared to state-of-the-art methods

Journal: Multimedia Systems

ISSN: 1432-1882

Issue: 2

Volume: 28

Pages: 479-493

Publication year:2022

DOI: https://doi.org/10.1007/s00530-021-00849-8
Scopus Id: 85116435608
WoS Id: 000703965800001
ORCID: /0000-0002-1774-2970/work/105915065

BOF-keylabel:yes

IOF-keylabel:yes

BOF-publication weight:1

Authors:International

Authors from:Government, Higher Education

Accessibility:Closed

Publication

A multi-scale multi-attention network for dynamic facial expression recognition

Journal Contribution - Journal Article

Authors/publisher

Research units