
Learning deep features to recognise speech emotion using merged deep CNN
Author(s) -
Zhao Jianfeng,
Mao Xia,
Chen Lijiang
Publication year - 2018
Publication title -
iet signal processing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.384
H-Index - 42
ISSN - 1751-9683
DOI - 10.1049/iet-spr.2017.0320
Subject(s) - computer science , convolutional neural network , deep learning , artificial intelligence , benchmark (surveying) , pattern recognition (psychology) , transfer of learning , speech recognition , hyperparameter , geodesy , geography
This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one‐dimensional (1D) CNN branch and another 2D CNN branch, to learn the high‐level features from raw audio clips and log‐mel spectrograms. The building of the merged deep CNN consists of two steps. First, one 1D CNN and one 2D CNN architectures were designed and evaluated; then, after the deletion of the second dense layers, the two CNN architectures were merged together. To speed up the training of the merged CNN, transfer learning was introduced in the training. The 1D CNN and 2D CNN were trained first. Then, the learned features of the 1D CNN and 2D CNN were repurposed and transferred to the merged CNN. Finally, the merged deep CNN initialised with transferred features was fine‐tuned. Two hyperparameters of the designed architectures were chosen through Bayesian optimisation in the training. The experiments conducted on two benchmark datasets show that the merged deep CNN can improve emotion classification performance significantly.