z-logo
open-access-imgOpen Access
Multimodal object description network for dense captioning
Author(s) -
Wang Weixuan,
Hu Haifeng
Publication year - 2017
Publication title -
electronics letters
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.375
H-Index - 146
eISSN - 1350-911X
pISSN - 0013-5194
DOI - 10.1049/el.2017.0326
Subject(s) - computer science , closed captioning , object (grammar) , artificial intelligence , semantics (computer science) , sentence , convolution (computer science) , word (group theory) , convolutional neural network , natural language processing , artificial neural network , computer vision , pattern recognition (psychology) , image (mathematics) , mathematics , programming language , geometry
A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions‐convolution neural network (R‐CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here