TY - GEN
T1 - Deep Learning for Audio Visual Emotion Recognition
AU - Hussain, T.
AU - Wang, W.
AU - Bouaynaya, N.
AU - Fathallah-Shaykh, H.
AU - Mihaylova, L.
N1 - Funding Information:
Acknowledgements. We are grateful to EPSRC for funding this work through the EP/T013265/1 project NSF-EPSRC: “ShiRAS. Towards Safe and Reliable Autonomy in Sensor Driven” and the support for ShiRAS by the National Science Foundation under Grant USA NSF ECCS 1903466. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
Publisher Copyright:
© 2022 International Society of Information Fusion.
PY - 2022
Y1 - 2022
N2 - Human emotions can be presented in data with multiple modalities, e.g. video, audio and text. An automated system for emotion recognition needs to consider a number of challenging issues, including feature extraction, and dealing with variations and noise in data. Deep learning have been extensively used recently, offering excellent performance in emotion recognition. This work presents a new method based on audio and visual modalities, where visual cues facilitate the detection of the speech or non-speech frames and the emotional state of the speaker. Different from previous works, we propose the use of novel speech features, e.g. the Wavegram, which is extracted with a one-dimensional Convolutional Neural Network (CNN) learned directly from time-domain waveforms, and Wavegram-Logmel features which combines the Wavegram with the log mel spectrogram. The system is then trained in an end-to-end fashion on the SAVEE database by also taking advantage of the correlations among each of the streams. It is shown that the proposed approach outperforms the traditional and state-of-the art deep learning based approaches, built separately on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions.
AB - Human emotions can be presented in data with multiple modalities, e.g. video, audio and text. An automated system for emotion recognition needs to consider a number of challenging issues, including feature extraction, and dealing with variations and noise in data. Deep learning have been extensively used recently, offering excellent performance in emotion recognition. This work presents a new method based on audio and visual modalities, where visual cues facilitate the detection of the speech or non-speech frames and the emotional state of the speaker. Different from previous works, we propose the use of novel speech features, e.g. the Wavegram, which is extracted with a one-dimensional Convolutional Neural Network (CNN) learned directly from time-domain waveforms, and Wavegram-Logmel features which combines the Wavegram with the log mel spectrogram. The system is then trained in an end-to-end fashion on the SAVEE database by also taking advantage of the correlations among each of the streams. It is shown that the proposed approach outperforms the traditional and state-of-the art deep learning based approaches, built separately on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions.
UR - http://www.scopus.com/inward/record.url?scp=85136586642&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85136586642&partnerID=8YFLogxK
U2 - 10.23919/FUSION49751.2022.9841342
DO - 10.23919/FUSION49751.2022.9841342
M3 - Conference contribution
AN - SCOPUS:85136586642
T3 - 2022 25th International Conference on Information Fusion, FUSION 2022
BT - 2022 25th International Conference on Information Fusion, FUSION 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th International Conference on Information Fusion, FUSION 2022
Y2 - 4 July 2022 through 7 July 2022
ER -