A Convolutional Temporal Encoder for Video Caption Generation
Author(s) -
Qingle Huang,
Zicheng Liao
Publication year - 2017
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5244/c.31.126
Subject(s) - computer science , encoder , convolutional code , artificial intelligence , decoding methods , telecommunications , operating system
We propose a convolutional temporal encoding network for video sequence embedding and caption generation. The mainstream video captioning work is based on recurrent encoder of various forms (e.g. LSTMs and hierarchical encoders). In this work, a multi-layer convolutional neural network encoder is proposed. At the core of this encoder is a gated linear unit (GLU) that performs a linear convolutional transformation of input with a nonlinear gating, which has demonstrated superior performance in natural language modeling. Our model is built on top of this unit for video encoding and integrates several up-to-date tricks including batch normalization, skip connection and soft attention. Experiment on two large-scale benchmark datasets (MSAD and M-VAD) generates strong results and demonstrates the effectiveness of our model.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom