
TVPRNN for image caption generation
Author(s) -
Yang Liang,
Hu Haifeng
Publication year - 2017
Publication title -
electronics letters
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.375
H-Index - 146
eISSN - 1350-911X
pISSN - 0013-5194
DOI - 10.1049/el.2017.2351
Subject(s) - computer science , image (mathematics) , computer vision , remote sensing , computer graphics (images) , artificial intelligence , geology
Image caption generation has attracted considerable interest in computer vision and natural language processes. However, existing methods usually use convolution neural network (CNN) for extracting image feature and recurrent NN (RNN) to predict next produced word, which may make the obtained features unadaptable to the word generated at current time. A new time‐varying parallel RNN (TVPRNN) to deal with this task is proposed. TVPRNN uses two classical CNNs (i.e. VggNet and inception v3) for extracting global image feature, jointly with RNN to obtain time‐varying feature at each time step, which are used for representing current word. Visual and textual representation in a multimodal space is fused. Moreover, visual attention mechanism is introduced to guild the proposed network. The approach is evaluated using the benchmark Flickr8k, Flickr30k and MSCOCO datasets. The experimental results show that TVPRNN achieves better or on par with the state‐of‐the‐art methods.