z-logo
open-access-imgOpen Access
Video captioning based on vision transformer and reinforcement learning
Author(s) -
Hong Zhao,
Zhiwen Chen,
Lan Guo,
Zeyu Han
Publication year - 2022
Publication title -
peerj. computer science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.806
H-Index - 24
ISSN - 2376-5992
DOI - 10.7717/peerj-cs.916
Subject(s) - closed captioning , computer science , reinforcement learning , transformer , artificial intelligence , decoding methods , encode , speech recognition , machine learning , image (mathematics) , telecommunications , biochemistry , chemistry , physics , quantum mechanics , voltage , gene
Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here