Attention-based fusion network for human eye-fixation prediction in 3D images | Zendy

Ying Lv | Zendy; Wujie Zhou | Zendy; Jingsheng Lei | Zendy; Lv Ye | Zendy; Ting Luo | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Attention-based fusion network for human eye-fixation prediction in 3D images

Author(s) -

Ying Lv,

Wujie Zhou,

Jingsheng Lei,

Lv Ye,

Ting Luo

Publication year - 2019

Publication title -

optics express

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.394

H-Index - 271

ISSN - 1094-4087

DOI - 10.1364/oe.27.034056

Subject(s) - computer science , artificial intelligence , rgb color model , pattern recognition (psychology) , segmentation , margin (machine learning) , convolutional neural network , focus (optics) , feature extraction , computer vision , feature (linguistics) , machine learning , optics , linguistics , philosophy , physics

Human eye-fixation prediction in 3D images is important for many 3D applications, such as fine-grained 3D video object segmentation and intelligent bulletproof curtains. While the vast majority of existing 2D-based approaches cannot be applied, the main challenge lies in the inconsistency, or even conflict, between the RGB and depth saliency maps. In this paper, we propose a three-stream architecture to accurately predict human visual attention on 3D images end-to-end. First, a two-stream feature extraction network based on advanced convolutional neural networks is trained for RGB and depth, and hierarchical information is extracted from each ResNet-18. Then, these multi-level features are fed into the channel attention mechanism to suppress the feature space inconsistency and make the network focus on a significant target. The enhanced saliency map is fused step-by-step by VGG-16 to generate the final coarse saliency map. Finally, each coarse map is refined empirically through refinement blocks, and the network's own identification errors are corrected based on the acquired knowledge, thus converting the prediction saliency map from coarse to fine. The results of comparison of our model with six other state-of-the-art approaches on the NUS dataset (CC of 0.5579, KLDiv of 1.0903, AUC of 0.8339, and NSS of 2.3373) and the NCTU dataset (CC of 0.8614, KLDiv of 0.2681, AUC of 0.9143, and NSS of 2.3795) indicate that the proposed model consistently outperforms them by a considerable margin as it fully employs the channel attention mechanism.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research