
Attention-based fusion network for human eye-fixation prediction in 3D images
Author(s) -
Yiliang Lv,
Wujie Zhou,
Jingsheng Lei,
Ye Lv,
Ting Luo
Publication year - 2019
Publication title -
optics express
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.394
H-Index - 271
ISSN - 1094-4087
DOI - 10.1364/oe.27.034056
Subject(s) - computer science , artificial intelligence , rgb color model , pattern recognition (psychology) , segmentation , margin (machine learning) , convolutional neural network , focus (optics) , feature extraction , computer vision , feature (linguistics) , machine learning , optics , linguistics , philosophy , physics
Human eye-fixation prediction in 3D images is important for many 3D applications, such as fine-grained 3D video object segmentation and intelligent bulletproof curtains. While the vast majority of existing 2D-based approaches cannot be applied, the main challenge lies in the inconsistency, or even conflict, between the RGB and depth saliency maps. In this paper, we propose a three-stream architecture to accurately predict human visual attention on 3D images end-to-end. First, a two-stream feature extraction network based on advanced convolutional neural networks is trained for RGB and depth, and hierarchical information is extracted from each ResNet-18. Then, these multi-level features are fed into the channel attention mechanism to suppress the feature space inconsistency and make the network focus on a significant target. The enhanced saliency map is fused step-by-step by VGG-16 to generate the final coarse saliency map. Finally, each coarse map is refined empirically through refinement blocks, and the network's own identification errors are corrected based on the acquired knowledge, thus converting the prediction saliency map from coarse to fine. The results of comparison of our model with six other state-of-the-art approaches on the NUS dataset (CC of 0.5579, KLDiv of 1.0903, AUC of 0.8339, and NSS of 2.3373) and the NCTU dataset (CC of 0.8614, KLDiv of 0.2681, AUC of 0.9143, and NSS of 2.3795) indicate that the proposed model consistently outperforms them by a considerable margin as it fully employs the channel attention mechanism.