Reasoning with Heterogeneous Graph Alignment for Video Question Answering
Author(s) -
Jiang Pin,
Yahong Han
Publication year - 2020
Publication title -
proceedings of the aaai conference on artificial intelligence
Language(s) - English
Resource type - Journals
eISSN - 2374-3468
pISSN - 2159-5399
DOI - 10.1609/aaai.v34i07.6767
Subject(s) - computer science , modality (human–computer interaction) , artificial intelligence , benchmark (surveying) , graph , question answering , heterogeneous network , modalities , representation (politics) , network architecture , theoretical computer science , computer network , telecommunications , social science , wireless network , geodesy , sociology , politics , political science , law , wireless , geography
The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom