EDATRAF: An Enhanced Depth-Aware Transformer for Monocular 3D Object Detection Using Feature Fusion and Cross-Query Attention
Author(s) -
Daewoong Cha,
Samuel Kakuba,
George Albert Bitwire,
Dong Seog Han
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3613157
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Monocular 3D object detection remains a critical yet challenging task in autonomous driving due to depth ambiguities and occlusions inherent in single RGB images. While recent transformer-based methods leverage global context via self-attention, they still suffer from limited spatial precision and weak depth supervision. In this paper, we propose an enhanced depth-aware transformer framework (EDATRAF) for monocular 3D object detection. EDATRAF integrates several key components in a depth-aware feature enhancement (DFE) module that fuses RGB and predicted depth features via cross-attention to reduce depth ambiguity. It includes a depth target feature (DTF) volume with instance-aware selective refinement (ISR) to enforce object-level depth consistency through auxiliary volume supervision and foreground-weighted query embeddings combined with cross-query attention (CQA) to enhance query discrimination and representation. To improve training, we design a distance-aware loss weighting scheme and a Bird’s-Eye-View (BEV) loss that captures object extent and orientation in the ground plane. Additionally, EDATRAF adopts a multi-scale positional encoding and supports dynamic anchor boxes (DAB) style query initialization for enhanced spatial reasoning. Extensive experiments on the KITTI dataset show that EDATRAF significantly outperforms the baseline models in terms of 3D average precision (AP) and BEV intersection over uinon (IoU), particularly in scenes with occlusion, scale variation, and complex spatial layouts. These results highlight the robustness and effectiveness of our approach for real-world autonomous perception.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom