z-logo
open-access-imgOpen Access
Deepfake Audio Detection: A Comparative Study of Advanced Deep Learning Models
Author(s) -
Kavya Verma,
Divyanshi Mittal,
Sagnik Samanta,
Kabir Gulati,
Ojas Kulkarni,
Muzaffar Ahmad Dar,
C. L. Biji
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3611839
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
The rapid advancements in artificial intelligence (AI) have significantly enhanced audio synthesis capabilities, enabling deepfake technology to replicate human speech with near-perfect accuracy. This poses severe security threats to various domains, including banking, customer service, and law enforcement, as malicious actors can exploit speech synthesis techniques such as voice conversion, replay attacks, and text-to-speech (TTS) to manipulate or impersonate individuals. Both spectral and temporal features were extracted from the audio recordings for feature extraction. This paper explores state-of-the-art detection frameworks for deepfake audio, highlighting the effectiveness of advanced deep learning (DL) frameworks. Specifically, the performance of the bidirectional long short-term memory (BLSTM) network, a custom convolutional neural network (CNN), a residual CNN integrated with an attention mechanism and bidirectional gated recurrent unit (ResCNN-Attention-BGRU), WIREnet (a variant of BLSTM), the residual network for fully connected (ResNet FC), and a squeeze-and-excitation-enhanced one-dimensional CNN (SE-Enhanced 1D-CNN) were compared. The proposed models achieved notable testing accuracies, with SE-Enhanced 1D-CNN reaching the highest at 97.64%, followed by ResNet FC (97.46%) and WIREnet (97.20%), demonstrating strong generalization across the evaluated architectures. Furthermore, we experimented with different numbers of mel-frequency cepstral coefficients (MFCCs), specifically 13, 26, and 39, in combination with other spectral and temporal features. The experimental results demonstrated that MFCC-39, together with spectral and temporal features, had a robust feature representation and achieved the best performance for deepfake detection.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom