
A Review of Audio-Visual Fusion with Machine Learning
Author(s) -
Xiaoyu Song,
Hong Chen,
Qing Wang,
Yunqiang Chen,
Mengxiao Tian,
Hui Tang
Publication year - 2019
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1237/2/022144
Subject(s) - computer science , modal , modalities , speech recognition , audio visual , biometrics , artificial intelligence , machine learning , pattern recognition (psychology) , multimedia , social science , chemistry , sociology , polymer chemistry
For the study of single-modal recognition, for example, the research on speech signals, ECG signals, facial expressions, body postures and other physiological signals have made some progress. However, the diversity of human brain information sources and the uncertainty of single-modal recognition determine that the accuracy of single-modal recognition is not high. Therefore, building a multimodal recognition framework in combination with multiple modalities has become an effective means of improving performance. With the rise of multi-modal machine learning, multi-modal information fusion has become a research hotspot, and audio-visual fusion is the most widely used direction. The audio-visual fusion method has been successfully applied to various problems, such as emotion recognition and multimedia event detection, biometric and speech recognition applications. This paper firstly introduces multimodal machine learning briefly, and then summarizes the development and current situation of audio-visual fusion technology in some major areas, and finally puts forward the prospect for the future.