Could Describe Anything Model understand Remote Sensing Objects?
Author(s) -
Ziyi Gao,
Shuzhou Sun,
Ming-Ming Cheng,
Yunli Long,
Yongxiang Liu,
Li Liu
Publication year - 2025
Publication title -
ieee journal of selected topics in applied earth observations and remote sensing
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 1.246
H-Index - 88
eISSN - 2151-1535
pISSN - 1939-1404
DOI - 10.1109/jstars.2025.3616330
Subject(s) - geoscience , signal processing and analysis , power, energy and industry applications
Fine-grained image captioning for Remote Sensing Images (RSIs) remains an underexplored yet crucial task for multimodal scene understanding, especially in domains where localized semantic interpretation is essential. However, the heterogeneous nature of remote sensing data—spanning modalities such as visible light, infrared, and Synthetic Aperture Radar (SAR)—poses significant challenges for generalization, particularly under the lack of sentence-level annotations. Existing vision-language models, primarily trained on natural imagetext pairs, often fail to capture spatially grounded semantics in RSIs, leading to degraded performance when transferred to non-natural domains. In this paper, we present the first systematic evaluation of the Describe Anything Model (DAM) for localized captioning in remote sensing. To address the scarcity of aligned supervision, we construct a weakly supervised benchmark framework featuring three levels of region-of-interest prompting (full image, center point, bounding box), and harmonize test data across three modalities with a unified object category (ship). Caption quality is assessed through both linguistic diversity metrics and multimodal alignment indicators, including CLIP and RemoteCLIP scores. Experimental results reveal that DAM performs robustly on visible imagery with fine-grained prompts, but exhibits significant performance degradation in infrared and SAR domains, where modality-specific distortions hinder effective spatial grounding. Our benchmark exposes a critical bottleneck in current foundation models for cross-modality captioning and establishes a unified testbed for evaluating and improving multimodal language understanding in the remote sensing field
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom