Speech Emotion Recognition Using Deep Feedforward Neural Network | Zendy

Muhammad Fahreza Alghifari | Zendy; Teddy Surya Gunawan | Zendy; Mira Kartiwi | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Speech Emotion Recognition Using Deep Feedforward Neural Network

Author(s) -

Muhammad Fahreza Alghifari,

Teddy Surya Gunawan,

Mira Kartiwi

Publication year - 2018

Publication title -

indonesian journal of electrical engineering and computer science

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.241

H-Index - 17

eISSN - 2502-4760

pISSN - 2502-4752

DOI - 10.11591/ijeecs.v10.i2.pp554-561

Subject(s) - mel frequency cepstrum , speech recognition , computer science , artificial neural network , emotion recognition , set (abstract data type) , artificial intelligence , feature extraction , pattern recognition (psychology) , programming language

Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions. Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research