Speech Emotion Recognition Using Scalogram Based Deep Structure
Author(s) -
Khadijeh Aghajani,
Iman Esmaili Paeen Afrakoti
Publication year - 2020
Publication title -
international journal of engineering. transactions b: applications
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.213
H-Index - 17
ISSN - 1728-144X
DOI - 10.5829/ije.2020.33.02b.13
Subject(s) - computer science , recurrent neural network , speech recognition , salient , artificial intelligence , convolutional neural network , classifier (uml) , pattern recognition (psychology) , emotion recognition , feature extraction , artificial neural network
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concatenated Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). The CNN can be used to learn local salient features from speech signals, images, and videos. Moreover, the RNNs have been used in many sequential data processing tasks in order to learn long-term dependencies between the local features. A combination of these two gives us the advantage of the strengths of both networks. In the proposed method, CNN has been applied directly to a scalogram of speech signals. Then, the attention-mechanism-based RNN model was used to learn long-term temporal relationships of the learned features. Experiments on various data such as RAVDESS, SAVEE, and Emo-DB demonstrate the effectiveness of the proposed SER method.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom