Transforming Disability into Ability: An Explainable Vision-to-Voice Image Captioning Framework Using Transformer Models and Edge Computing | Zendy

Ghadah Naif Alwakid | Zendy; Mamoona Humayun | Zendy; Zulfiqar Ahmad | Zendy

Open Access

Transforming Disability into Ability: An Explainable Vision-to-Voice Image Captioning Framework Using Transformer Models and Edge Computing

Author(s) -

Ghadah Naif Alwakid,

Mamoona Humayun,

Zulfiqar Ahmad

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3618646

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Image captioning is an emerging field at the intersection of computer vision and natural language processing (NLP). It has shown great potential to enhance accessibility by automatically generating descriptive text for visual content. For visually impaired individuals, such technologies can transform static visual information into meaningful audio narratives and enable them for greater independence and participation in daily life. This study introduces an explainable image captioning framework for visually impaired individuals with an ability to turn visual scenes into descriptive audio output in real-time. Using the Flickr8k dataset, we trained and tested two different models of image captioning. The first model uses a Convolutional Neural Network (CNN), specifically ResNet50, as a visual encoder and a Long Short-Term Memory (LSTM) network as a language decoder. This approach incorporates Grad-CAM so that it can visualize spatial areas affecting each word prediction, and it reaches 0.5222 and 0.1069 of BLEU-1 and BLEU-4 scores. The second and more sophisticated model has the same CNN encoder, but instead of LSTM, it uses a decoder based on Transformer with multi-head self-attention and positional encoding. An attention-based model showed a significant increase in performance to BLEU-1 and BLEU-4 scores of 0.6163 and 0.2225. It has word-level self-attention visualizations that come convenient to make it more explainable, as it shows inter-token relationships in the process of caption generation. We also suggest an edge-deployable framework of IoT-enabled wearable devices that would allow local image captioning and real-time speech generation without the use of cloud services. A combination of precision, interpretability, and deployment enables the framework to be a powerful and transparent assistive framework to be used in the real world to support visually impaired individuals.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research