z-logo
open-access-imgOpen Access
Deepfake Audio Detection for Urdu Language Using Deep Neural Networks
Author(s) -
Omair Ahmad,
Muhammad Sohail Khan,
Salman Jan,
Inayat Khan
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3571293
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Audio Deepfakes, which are highly realistic fake audio recordings driven by AI tools that clone human voices, With Advancements in Text-Based Speech Generation (TTS) and Vocal Conversion (VC) technologies have enabled it easier to create realistic synthetic and imitative speech, making audio Deepfakes a common and potentially dangerous form of deception. Well-known people, like politicians and celebrities, are often targeted. They get tricked into saying controversial things in fake recordings, causing trouble on social media. Even kids’ voices are cloned to scam parents into ransom payments, etc. Therefore, developing effective algorithms to distinguish Deepfake audio from real audio is critical to preventing such frauds. Various Machine learning (ML) and Deep learning (DL) techniques have been created to identify audio Deepfakes. However, most of these solutions are trained on datasets in English, Portuguese, French, and Spanish, expressing concerns regarding their correctness for other languages. The main goal of the research presented in this paper is to evaluate the effectiveness of deep learning neural networks in detecting audio Deepfakes in the Urdu language. Since there’s no suitable dataset of Urdu audio available for this purpose, we created our own dataset (URFV) utilizing both genuine and fake audio recordings. The Urdu Original/real audio recordings were gathered from random youtube podcasts and generated as Deepfake audios using the RVC model. Our dataset has three versions with clips of 5, 10, and 15 seconds. We have built various deep learning neural networks like (RNN+LSTM, CNN+attention, TCN, CNN+RNN) to detect Deepfake audio made through imitation or synthetic techniques. The proposed approach extracts Mel-Frequency-Cepstral-Coefficients (MFCC) features from the audios in the dataset. When tested and evaluated, Our models’ accuracy across datasets was noteworthy. 97.78% (5s), 98.89% (10s), and 98.33% (15s) were remarkable results for the RNN+LSTM model. In comparison, the CNN+RNN hybrid model equaled 92.2% (5s) and scored 98.89% (10s) and 93.33% (15s), while the CNN+Attention model recorded 98.89% (5s), 92.22% (10s), and 93.33% (15s). Despite its lower effectiveness, the TCN model obtained 70% (5s), 83.33% (10s), and 77.78% (15s).

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Empowering knowledge with every search

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom