z-logo
open-access-imgOpen Access
A Comparative Study of Deep Audio Models for Spectrogram- and Waveform-based SingFake Detection
Author(s) -
Minh Nguyen-Duc,
Luong Vuong Nguyen,
Huy Nguyen-Ho-Nhat,
Tri-Hai Nguyen,
O-Joun Lee
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3571728
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Recent advancements in singing voice synthesis have significantly improved the quality of artificial singing voices, raising concerns about their potential misuse in generating deepfake singing, or "singfake" voices. Detecting these synthetic voices presents unique challenges due to the complex nature of singing, which involves pitch, timbre, and accompaniment variations. In this study, we conduct a comparative analysis of two model types for singfake detection: (1) models utilizing Log-Mel spectrograms, such as Audio Spectrogram Transformer (AST) and Whisper, and (2) models that process raw waveform inputs, including UniSpeech-SAT and HuBERT. Our experiments on the SingFake dataset evaluate these models under two input conditions—separated vocal tracks and full song mixtures—across different test subsets. The results indicate that spectrogram-based models generally outperform waveform-based models, notably on unseen singers. Metrics such as Precision, Recall, F1-score, Equal Error Rate (EER), and Area Under the Curve (AUC) provide insights into the strengths and weaknesses of each approach. Our findings contribute to developing more effective deepfake singing detection methods, with implications for security, media authentication, and digital content protection.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Empowering knowledge with every search

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom