z-logo
open-access-imgOpen Access
Multimodal Fusion of Speech and Gesture Recognition based on Deep Learning
Author(s) -
Xiaoyu Qiu,
Zhiquan Feng,
Xiaohui Yang,
Jinglan Tian
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1453/1/012092
Subject(s) - gesture , computer science , speech recognition , gesture recognition , similarity (geometry) , fusion , mode (computer interface) , artificial intelligence , architecture , deep learning , pattern recognition (psychology) , human–computer interaction , image (mathematics) , linguistics , art , philosophy , visual arts
This paper proposes a multimodal fusion architecture based on deep learning. The architecture consists of two forms: speech command and hand gesture. First, the speech and gesture commands input by users are recognized by CNN for speech command recognition and LSTM for hand gesture recognition respectively. Secondly, the obtained results are searched by keywords and compared by similarity degree to obtain recognition results. Finally, the two results are fused to output the final instructions. Experiments show that the proposed multi-mode fusion model is superior to the single-mode fusion model.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here