CTNet:Multimodal Remote Sensing Image Key Point Detection and Description for CNN and Transformer Architectures | Zendy

Chenke Yue | Zendy; Yin Zhang | Zendy; Junhua Yan | Zendy; Yong Liu | Zendy; Pengyu Guo | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

CTNet:Multimodal Remote Sensing Image Key Point Detection and Description for CNN and Transformer Architectures

Author(s) -

Chenke Yue,

Yin Zhang,

Junhua Yan,

Yong Liu,

Pengyu Guo

Publication year - 2025

Publication title -

ieee journal of selected topics in applied earth observations and remote sensing

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 1.246

H-Index - 88

eISSN - 2151-1535

pISSN - 1939-1404

DOI - 10.1109/jstars.2025.3595440

Subject(s) - geoscience , signal processing and analysis , power, energy and industry applications

Keypoint detection and description from multisensor or multimodal images are fundamental to image registration and its downstream tasks. However, the nonlinear radiometric differences, illumination variations, and geometric distortions between multimodal remote sensing images pose significant challenges. To address these issues, this paper proposes a weakly supervised multimodal keypoint detection and description network (CTNet), which extracts robust and repeatable feature descriptors at a low cost without requiring densely labeled annotations or extensive pretraining. In terms of network design, CTNet effectively combines convolutional neural network (CNN) and Transformer architectures by introducing a multimodal global and local information interaction (MGLI) module. Additionally, a lightweight keypoint detector is designed to efficiently detect keypoints by evaluating pixel saliency within neighborhoods and incorporating their depth maxima. For model optimization, a novel loss function, multiple pair weighted loss, is introduced. This loss function samples and weights positive and negative pairs of multimodal features, effectively capturing the similarity relationships among samples to learn a robust feature embedding space. Finally, CTNet is evaluated on both public and self-collected multimodal VIS-SAR and VIS-IR image datasets and compared with state-of-the-art keypoint detection and description models. Experimental results demonstrate that CTNet achieves superior matching accuracy and robustness in multimodal image matching tasks, outperforming existing methods.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research