Efficient Detection of Targeted Adversarial Attacks in Automatic Speech Recognition Systems
Author(s) -
Daniyal Parveez,
Zesheng Chen,
Jack Li,
Chao Chen
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3613140
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Automatic Speech Recognition (ASR) systems convert spoken words into text and play a critical role in voice assistants, transcription services, and other accessibility tools. However, their reliance on machine learning models makes them susceptible to adversarial attacks. By adding small, often imperceptible perturbations to the input audio, or by convolving it with a carefully designed room impulse response, attackers can cause ASR systems to produce incorrect or even malicious transcripts. In this work, we propose an ensemble detection method for the efficient identification of targeted adversarial attacks in ASR systems. Specifically, we introduce MELo-FEST, which calculates the minimum energy in the low-frequency band of an audio signal to detect convolutive adversarial attacks. We then combine two spectrogram-based detection methods with a noise-adding approach to form an ensemble detector capable of identifying both additive and convolutive targeted adversarial attacks. Through extensive experiments, we demonstrate that our proposed ensemble detector can accurately identify adversarial audio generated by both non-adaptive and adaptive CW, PGD, and AdvReverb attacks in Wav2Vec2 and Whisper ASR systems, achieving an F1 score of at least 0.97. Moreover, our method performs both speech recognition and adversarial detection for each input audio sample in an average of under 0.13 seconds for Wav2Vec2 and 0.29 seconds for Whisper, making it well-suited for real-time applications. Additionally, we find that Whisper is more vulnerable than Wav2Vec2 to both non-adaptive and adaptive targeted adversarial attacks.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom