Research Library

open-access-imgOpen AccessMining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Author(s)
Longtian Qiu,
Shan Ning,
Xuming He
Publication year2024
Image captioning aims at generating descriptive and meaningful textualdescriptions of images, enabling a broad range of vision-language applications.Prior works have demonstrated that harnessing the power of Contrastive ImageLanguage Pre-training (CLIP) offers a promising approach to achieving zero-shotcaptioning, eliminating the need for expensive caption annotations. However,the widely observed modality gap in the latent space of CLIP harms theperformance of zero-shot captioning by breaking the alignment between pairedimage-text features. To address this issue, we conduct an analysis on the CLIPlatent space which leads to two findings. Firstly, we observe that the CLIP'svisual feature of image subregions can achieve closer proximity to the pairedcaption due to the inherent information loss in text descriptions. In addition,we show that the modality gap between a paired image-text can be empiricallymodeled as a zero-mean Gaussian distribution. Motivated by the findings, wepropose a novel zero-shot image captioning framework with text-only training toreduce the modality gap. In particular, we introduce a subregion featureaggregation to leverage local region information, which produces a compactvisual representation for matching text representation. Moreover, weincorporate a noise injection and CLIP reranking strategy to boost captioningperformance. We also extend our framework to build a zero-shot VQA pipeline,demonstrating its generality. Through extensive experiments on commoncaptioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show thatour method achieves remarkable performance improvements. Code is available athttps://github.com/Artanic30/MacCap.
Language(s)English

Seeing content that should not be on Zendy? Contact us.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here