z-logo
open-access-imgOpen Access
Transformer with sparse self‐attention mechanism for image captioning
Author(s) -
Wang Duofeng,
Hu Haifeng,
Chen Dihu
Publication year - 2020
Publication title -
electronics letters
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.375
H-Index - 146
ISSN - 1350-911X
DOI - 10.1049/el.2020.0635
Subject(s) - transformer , computer science , encoder , artificial intelligence , computer vision , engineering , electrical engineering , voltage , operating system
Recently, transformer has been applied to the image caption model, in which the convolutional neural network and the transformer encoder act as the image encoder of the model, and the transformer decoder acts as the decoder of the model. However, transformer may suffer from the interference of non‐critical objects of a scene and meet with difficulty to fully capture image information due to its self‐attention mechanism's dense characteristics. In this Letter, in order to address this issue, the authors propose a novel transformer model with decreasing attention gates and attention fusion module. Specifically, they firstly use attention gate to force transformer to overcome the interference of non‐critical objects and capture objects information more efficiently via truncating all the attention weights that smaller than gate threshold. Secondly, through inheriting attentional matrix from the previous layer of each network layer, the attention fusion module enables each network layer to consider other objects without losing the most critical ones. Their method is evaluated using the benchmark Microsoft COCO dataset and achieves better performance compared to the state‐of‐the‐art methods.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here