z-logo
open-access-imgOpen Access
Dual-Stream Multi-Modal Fusion with Local-Global Attention for Remote Sensing Object Detection
Author(s) -
Youxiang Huang,
Zhuo Wang,
Tiantian Tang,
Tomoaki Ohtsuki,
Guan Gui
Publication year - 2025
Publication title -
ieee journal of selected topics in applied earth observations and remote sensing
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 1.246
H-Index - 88
eISSN - 2151-1535
pISSN - 1939-1404
DOI - 10.1109/jstars.2025.3637891
Subject(s) - geoscience , signal processing and analysis , power, energy and industry applications
Object detection in remote sensing imagery plays a crucial role in providing precise geospatial information for urban planning and environmental monitoring. However, real-world remote sensing scenarios often involve complex conditions such as varying illumination, weather interference, and low signal-to-noise ratios, which significantly degrade the performance of traditional single-modal detection methods. To overcome these limitations, multimodal object detection has developed, demonstrating great potential by integrating complementary information from multiple modalities. Nevertheless, existing multimodal frameworks still face challenges such as insufficient cross-modal interaction, limited learning of complementary features, and high computational costs due to redundant fusion in complex environments. To overcome these challenges, we propose an enhanced multi-modal fusion strategy aimed at maximizing cross-modal feature learning capabilities. Our method employs a dual-backbone architecture to extract mode-specific representations independently, integrating a direction attention (DA) module at an early stage of each backbone to enhance discriminative feature extraction. We then introduce a dual-stream feature fusion network (DSFN) to effectively fuse cross-modal features, generating rich representations for the detection head. Additionally, we embed a local-global channel attention (LGCA) mechanism in the head stage to strengthen feature learning in the channel dimension before generating the final prediction. Extensive experiments on the widely used VEDAI multimodal remote sensing dataset demonstrate that our method achieves state-of-the-art performance, while evaluations on single-modal datasets confirm its exceptional generalization capability.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom