Cross-modality Consistency Network for Remote Sensing Text-image Retrieval | Zendy

Yuchen Sha | Zendy; Yujian Feng | Zendy; Miao He | Zendy; Yichi Jin | Zendy; Shuai You | Zendy; Yimu Ji | Zendy; Fei Wu | Zendy; Shangdong Liu | Zendy; Shaoshuai Che | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Cross-modality Consistency Network for Remote Sensing Text-image Retrieval

Author(s) -

Yuchen Sha,

Yujian Feng,

Miao He,

Yichi Jin,

Shuai You,

Yimu Ji,

Fei Wu,

Shangdong Liu,

Shaoshuai Che

Publication year - 2025

Publication title -

ieee journal of selected topics in applied earth observations and remote sensing

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 1.246

H-Index - 88

eISSN - 2151-1535

pISSN - 1939-1404

DOI - 10.1109/jstars.2025.3586914

Subject(s) - geoscience , signal processing and analysis , power, energy and industry applications

Remote Sensing Cross-modality Text-Image Retrieval (RSCTIR) aims to retrieve a specific object from a large image gallery based on a natural language description, and vice versa. Existing methods mainly capture local and global context information within each modality for cross-modality matching. However, these methods are prone to interference from redundant information, such as background noises and irrelevant words, and neglect the capture of co-occurrence semantic relations between modalities ( i.e. , the probability of semantic information co-occurring with other information). To filter out intra-modality redundant information and capture inter-modality co-occurrent relations, we propose a Cross-modality Consistency Network (CCNet) including a Text-image Attention-conditioned Module (TAM) and a Co-occurrent Features Module (CFM). Firstly, TAM interacts with visual and textual feature representations by employing the cross-modality attention mechanism to focus on semantically similar fine-grained image features and then generate aggregated visual representations. Secondly, CFM is designed to estimate co-occurrence probability by measuring fine-grained feature similarity, thereby reinforcing the relations of target-consist features across modalities. In addition, we propose the Cross-modality Distinction (CD) loss function to learn semantic consistency between modalities by compacting intra-class samples and separating inter-class samples. Extensive benchmark experiments on three benchmarks demonstrate that our approach outperforms state-of-the-art methods.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research