z-logo
open-access-imgOpen Access
TDA-ViT: A Transformer-Based Framework for Unified Urdu Text Recognition via Topological and Visual Feature Fusion
Author(s) -
Shahbaz Hassan,
Ahmad Raza Shahid,
Asif Naeem
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3620875
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Urdu text recognition is challenging due to its cursive nature, complex ligatures, variable diacritics, and diverse writing styles. Existing methods often target either printed or handwritten text separately and rely only on raw pixel features, which limits robustness. We propose a unified end-to-end framework for Urdu text recognition using a dual-modality input strategy. The system fuses raw grayscale images with topological features from Topological Data Analysis (TDA) to capture both pixel-level and structural properties. A two-stream Vision Transformer (Twins-SVT-Large) is used for visual encoding, followed by a T5-based encoder and a GPT-2 decoder for auto-regressive sequence generation. To support comprehensive evaluation, we introduce Khat-e-FAST, a novel handwritten Urdu dataset collected from 1,000 native writers, and conduct experiments on Khat-e-FAST, NUST-UHWR, and UPTI datasets. The proposed framework is analyzed under three configurations: raw-only, Topological-only, and dual-modality to isolate and quantify the contribution of each modality. The dual-modality model achieves the best results with a Character Error Rate (CER) of 0.05 and Word Error Rate (WER) of 0.13. These findings confirm that topological features and visual representations are complementary, and their integration significantly enhances recognition accuracy.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom