z-logo
open-access-imgOpen Access
Exploring Vision Transformers and Explainable AI for Enhanced Artefact Classification in Esophageal Endoscopic Images
Author(s) -
Preeti Bissoonauth-Daiboo,
Muhammad Muzzammil Auzine,
Muhammad Inshal,
Fatima Shannaq,
Tanzila Saba,
Xiaohong Gao,
Maleika Heenaye-Mamode Khan
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3616796
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Esophageal cancer (EC) remains the disease that has the highest incidence and highest mortality rate in global cancer statistics, emphasising the imperative to enhance diagnostic precision and reliability through the use of advancing technologies. While AI-enhanced systems can improve the early detection of EC considerably, the prevalence of artefacts (1 in 4 frames) during endoscopy procedures compromises the developed systems significantly, leading to unreliable medical decision making. Vision transformer (ViT) networks, initially designed for natural language processing tasks, have demonstrated outstanding performance in handling medical images by presenting distinctive features advantageous for image processing. The application of ViT for detecting and classifying artefacts in endoscopic images, particularly in classifying colour misalignment artefacts is still subject to continual refinement and enhancement. This work aims to investigate the implementation of ViT for classification of colour misalignment artefacts in esophagus endoscopy images. Moreover, even though ViT has been a major breakthrough, its acceptance for real world applications is often jeopardised due to the lack of interpretability of how the classification results have been reached. Consequently, Explainable Artificial Intelligence (XAI) techniques have been explored to understand the criteria used to achieve the outcome. Several variants of the ViT and Data Efficient image Transformer (DeiT) networks have been fine-tuned and applied to our dataset in order to improve and evaluate their performance in colour misalignment classification in esophagus endoscopic images. Furthermore, XAI methods have been implemented to provide the criteria used by the network in reaching the classification results. Our fine-tuned ViT model, achieves an accuracy of 93.46%, precision of 93.48%, recall of 93.46% and F1 score of 93.46% surpassing InceptionResNetV2, a state-of-the-art model based on CNN, with an accuracy of 89.10%, precision of 89.10%, recall of 89.10% and F1 score of 88.23%. Additionally, the GradCAM XAI technique has been found to highlight the deterministic features used by the ViT model better than other XAI methods applied in this work. ViT achieves remarkable performance in classification of colour misalignment artefact outperforming CNNs, attributed to ViT’s enhanced ability to capture pixel relationships through self-attention weights. In addition, the intrinsic self-attention technique provides novel insights into the model’s decision-making mechanism.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom