Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition | Zendy

Xiaodong Liu | Zendy; Songyang Li | Zendy; Miao Wang | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition

Author(s) -

Xiaodong Liu,

Songyang Li,

Miao Wang

Publication year - 2021

Publication title -

computational intelligence and neuroscience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.605

H-Index - 52

eISSN - 1687-5273

pISSN - 1687-5265

DOI - 10.1155/2021/5585041

Subject(s) - computer science , artificial intelligence , feature (linguistics) , context (archaeology) , emotion recognition , feature extraction , pattern recognition (psychology) , modal , representation (politics) , facial expression , philosophy , linguistics , chemistry , politics , political science , polymer chemistry , law , biology , paleontology

The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research