LSD-HybridViT: A Hybrid Vision Transformer with Lightweight Mixed-domain Attention and Frequency Multi-Scale Dilated Convolution Feature Fusion for Diabetic Retinopathy Grading
Author(s) -
Xiaofang Gou,
Ye Wang,
Wenman Li
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3610904
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Diabetic Retinopathy (DR) grading plays a pivotal role in clinical decision-making, where precise staging directly impacts treatment planning and blindness prevention. Inspired by ophthalmologists’ diagnostic reasoning—which systematically integrates local lesion assessment with global pathological pattern analysis while incorporating edge and frequency-domain information for comprehensive image analysis—we propose an innovative hybrid framework. This framework emulates the clinical cognitive process through three synergistic components, constructing a clinically aligned feature hierarchy: (1) A Lightweight Mixed-domain Attention (LMA)-driven biomarker localization module enhances sensitivity to microaneurysms and exudates by learning spatial-channel dependencies while maintaining a lightweight architecture. (2) Frequency Multi-scale Dilated Convolutional Blocks integrate multi-scale spatial and frequency domain features, mimicking the human visual process in lesion identification. (3) Vision Transformers model long-range dependencies to analyze lesion distribution patterns and quantify pathological severity, analogous to clinicians’ holistic evaluation. The entire model contains only 10M parameters, effectively bridging the local modeling capacity of convolutional networks with the global reasoning ability of transformers. Experiments on the APTOS-2019 and DDR datasets demonstrate that our method achieves a grading accuracy of 89.6% and a QuadraticWeighted Kappa (QWK) of 93.1% on APTOS-2019, as well as a grading accuracy of 85.2% and aQWKof 84.9% on the DDR dataset. Results show that the proposed method outperforms many recently popular DR diagnosis and classification approaches, validating its excellence in performance.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom