LSD-HybridViT: A Hybrid Vision Transformer with Lightweight Mixed-domain Attention and Frequency Multi-Scale Dilated Convolution Feature Fusion for Diabetic Retinopathy Grading | Zendy

Xiaofang Gou | Zendy; Ye Wang | Zendy; Wenman Li | Zendy

Open Access

LSD-HybridViT: A Hybrid Vision Transformer with Lightweight Mixed-domain Attention and Frequency Multi-Scale Dilated Convolution Feature Fusion for Diabetic Retinopathy Grading

Author(s) -

Xiaofang Gou,

Ye Wang,

Wenman Li

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3610904

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Diabetic Retinopathy (DR) grading plays a pivotal role in clinical decision-making, where precise staging directly impacts treatment planning and blindness prevention. Inspired by ophthalmologists’ diagnostic reasoning—which systematically integrates local lesion assessment with global pathological pattern analysis while incorporating edge and frequency-domain information for comprehensive image analysis—we propose an innovative hybrid framework. This framework emulates the clinical cognitive process through three synergistic components, constructing a clinically aligned feature hierarchy: (1) A Lightweight Mixed-domain Attention (LMA)-driven biomarker localization module enhances sensitivity to microaneurysms and exudates by learning spatial-channel dependencies while maintaining a lightweight architecture. (2) Frequency Multi-scale Dilated Convolutional Blocks integrate multi-scale spatial and frequency domain features, mimicking the human visual process in lesion identification. (3) Vision Transformers model long-range dependencies to analyze lesion distribution patterns and quantify pathological severity, analogous to clinicians’ holistic evaluation. The entire model contains only 10M parameters, effectively bridging the local modeling capacity of convolutional networks with the global reasoning ability of transformers. Experiments on the APTOS-2019 and DDR datasets demonstrate that our method achieves a grading accuracy of 89.6% and a QuadraticWeighted Kappa (QWK) of 93.1% on APTOS-2019, as well as a grading accuracy of 85.2% and aQWKof 84.9% on the DDR dataset. Results show that the proposed method outperforms many recently popular DR diagnosis and classification approaches, validating its excellence in performance.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research