A Fine-Grained Spatial-Temporal Attention Model for Video Captioning | Zendy

An-An Liu | Zendy; Yurui Qiu | Zendy; Yongkang Wong | Zendy; Yu-Ting Su | Zendy; Mohan Kankanhalli | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

A Fine-Grained Spatial-Temporal Attention Model for Video Captioning

Author(s) -

An-An Liu,

Yurui Qiu,

Yongkang Wong,

Yu-Ting Su,

Mohan Kankanhalli

Publication year - 2018

Publication title -

ieee access

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.587

H-Index - 127

ISSN - 2169-3536

DOI - 10.1109/access.2018.2879642

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Attention mechanism has been extensively used in video captioning tasks, which enables further development of deeper visual understanding. However, most existing video captioning methods apply the attention mechanism on the frame level, which only model the temporal structure and generated words, but ignore the region-level spatial information that provides accurate visual features corresponding to the semantic content. In this paper, we propose a fine-grained spatial-temporal attention model (FSTA), and the spatial information of objects appearing in the video will be our main concern. In the proposed FSTA, we achieve the spatial-hard attention at a fine-grained region level of objects through the mask pooling module and compute the temporal soft attention by using a two-layer LSTM network with attention mechanism to generate sentences. We test the proposed model on two benchmark datasets, namely, MSVD and MSR-VTT. The results indicate that our proposed FSTA model can achieve competitive performance against the state of the arts on both datasets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research