GCViT-EffNet: Global Context Vision Transformer and EfficientNet Fusion for Classification of Ocular Diseases in CLAHE-Enhanced Fundus Images
Author(s) -
Irshad Ahmad,
Kaznah Alshammari
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3631556
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Accurate and early detection of ocular diseases such as cataracts, glaucoma, and diabetic retinopathy is essential for preventing vision loss and enabling timely treatment. Deep learning models have shown promising results in healthcare. However, ocular disease classification remains incredibly challenging due to the fine details in the preliminary stages of the disease. For accurate classification, features from both local and global pathological regions play a crucial role in determining the classification. In this study, we propose a novel hybrid deep learning model that integrates a Global Context Vision Transformer (GCViT) with EfficientNetB0 for multiclass classification of fundus images. EfficientNetB0 was chosen for its superior performance among models tested, which included EfficientNetB0, EfficientNetB1, EfficientNetB7, ResNet50, ResNet101, MobileNetV2, DenseNet121, InceptionV3, InceptionResNetV2, NASNetMobile, and Xception pretrained models. Contrast Limited Adaptive Histogram Equalization (CLAHE) in the LAB color space is applied as a preprocessing step to enhance ocular structures and improve image contrast. The hybrid model leverages EfficientNetB0 to capture fine-grained local features, while GCViT extracts global contextual representations. These feature maps are fused and passed through dense layers for multi-class classification. We evaluate the model on a balanced dataset of 4,217 fundus images spanning four classes. The model achieves a superior accuracy of 98.92% compared to standalone CNNs, transformer-based models, and state-of-the-art methods. Ablation studies confirm the effectiveness of CLAHE preprocessing and the complementary nature of CNN-Transformer fusion. The architecture demonstrates strong potential as an automated screening tool for ocular pathologies, with implications for clinical decision support and teleophthalmology.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom