
DVC‐Net: A deep neural network model for dense video captioning
Author(s) -
Lee Sujin,
Kim Incheol
Publication year - 2021
Publication title -
iet computer vision
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.38
H-Index - 37
eISSN - 1751-9640
pISSN - 1751-9632
DOI - 10.1049/cvi2.12013
Subject(s) - computer science , closed captioning , semantics (computer science) , context (archaeology) , benchmark (surveying) , recurrent neural network , artificial intelligence , artificial neural network , convolutional neural network , exploit , net (polyhedron) , language model , natural language processing , speech recognition , image (mathematics) , geometry , mathematics , paleontology , computer security , geodesy , biology , programming language , geography
Dense video captioning (DVC) detects multiple events in an input video and generates natural language sentences to describe each event. Previous studies predominantly used convolutional neural networks to extract visual features from videos but failed to employ high‐level semantics to effectively explain video content such as people, objects, actions, and places, and utilized only limited context information in generating natural language. To overcome these deficiencies, DVC‐Net is proposed, a new deep neural network model that uses high‐level semantics to efficiently represent important events as well as visual features. Additionally, DVC‐Net uses a bidirectional long short‐term memory network, a type of recurrent neural network, to detect events over time. Furthermore, DVC‐Net applies an attention mechanism and context gating to effectively exploit context information in a caption generation step. In experiments conducted versus state‐of‐the‐art models, DVC‐Net presented relative gains of over 1.72% (BLEU@1 score increases from 12.22 to 13.94) and 3.19% (CIDEr score increases from 12.61 to 15.80) on the large‐scale benchmark datasets, namely ActivityNet Captions and MSR‐VTT, respectively.