z-logo
open-access-imgOpen Access
Mutual information guided 3D ResNet for self‐supervised video representation learning
Author(s) -
Xue Fei,
Ji Hongbing,
Zhang Wenbo
Publication year - 2020
Publication title -
iet image processing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.401
H-Index - 45
eISSN - 1751-9667
pISSN - 1751-9659
DOI - 10.1049/iet-ipr.2020.0019
Subject(s) - computer science , mutual information , artificial intelligence , coherence (philosophical gambling strategy) , clips , feature learning , feature (linguistics) , pattern recognition (psychology) , task (project management) , mutual coherence , representation (politics) , similarity (geometry) , computer vision , machine learning , image (mathematics) , linguistics , philosophy , physics , management , quantum mechanics , politics , political science , law , economics
In this work, the authors propose a novel self‐supervised learning method based on mutual information to learn representations from the videos without manual annotation. Different video clips sampled from the same video usually have coherence in the temporal domain. To guide the network to learn such temporal coherence, they maximise the mutual information between global features extracted from different clips sampled from the same video (Global‐MI). However, maximising the Global‐MI leads the network to seek shared content from different video clips and may make the network degenerate to focus on the background of the video. Considering the structure of the video, they further maximise the average mutual information between the global feature and local patches of multiple regions of the video clip (multi‐region Local‐MI). Their approach, which is called Max‐GL, learns the temporal coherence by jointly maximising the Global‐MI and multi‐region Local‐MI. Experiments are conducted to validate the effectiveness of the proposed Max‐GL. Experimental results show that the Max‐GL can serve as an effective pre‐training method for the task of action recognition in videos. Additional experiments for the task of action similarity labelling and dynamic scene recognition also validate the generalisation of the learned representations of the Max‐GL.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here