Research Library

open-access-imgOpen AccessG2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
Author(s)
Hongxiang Li,
Meng Cao,
Xuxin Cheng,
Yaowei Li,
Zhihong Zhu,
Yuexian Zou
Publication year2024
The recent video grounding works attempt to introduce vanilla contrastivelearning into video grounding. However, we claim that this naive solution issuboptimal. Contrastive learning requires two key properties: (1)\emph{alignment} of features of similar samples, and (2) \emph{uniformity} ofthe induced distribution of the normalized features on the hypersphere. Due totwo annoying issues in video grounding: (1) the co-existence of some visualentities in both ground truth and other moments, \ie semantic overlapping; (2)only a few moments in the video are annotated, \ie sparse annotation dilemma,vanilla contrastive learning is unable to model the correlations betweentemporally distant moments and learned inconsistent video representations. Bothcharacteristics lead to vanilla contrastive learning being unsuitable for videogrounding. In this paper, we introduce Geodesic and Game Localization (G2L), asemantically aligned and uniform video grounding framework via geodesic andgame theory. We quantify the correlations among moments leveraging the geodesicdistance that guides the model to learn the correct cross-modalrepresentations. Furthermore, from the novel perspective of game theory, wepropose semantic Shapley interaction based on geodesic distance sampling tolearn fine-grained semantic alignment in similar moments. Experiments on threebenchmarks demonstrate the effectiveness of our method.
Language(s)English

Seeing content that should not be on Zendy? Contact us.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here