Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing | Zendy

Zibo Hu | Zendy; Kun Gao | Zendy; Jingyi Wang | Zendy; Zhijia Yang | Zendy; Zefeng Zhang | Zendy; Haobo Cheng | Zendy; Wei Li | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Author(s) -

Zibo Hu,

Kun Gao,

Jingyi Wang,

Zhijia Yang,

Zefeng Zhang,

Haobo Cheng,

Wei Li

Publication year - 2025

Publication title -

ieee journal of selected topics in applied earth observations and remote sensing

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 1.246

H-Index - 88

eISSN - 2151-1535

pISSN - 1939-1404

DOI - 10.1109/jstars.2025.3575770

Subject(s) - geoscience , signal processing and analysis , power, energy and industry applications

Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multi-object detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multi-scale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This paper addresses this gap by proposing an Enhanced Grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the Multi-Scale Visual-Cross-Text Fusion Module (MSVCTFM) and Inverse Pyramid Feature Refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multi-scale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multi-scale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multi-scale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research