z-logo
open-access-imgOpen Access
Foundation Models for Speech Enhancement Leveraging Consistency Constraints and Contrast Stretching
Author(s) -
Muhammad Salman Khan,
Valerio Mario Salerno,
Moreno La Quatra,
Kuo-Hsuan Hung,
Szu-Wei Fu,
Yu Tsao,
Sabato Marco Siniscalchi
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3619782
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Foundation models (FM) have proven effective in many speech applications except for speech enhancement (SE), where FM-based SE solutions still fall short with respect to specialized deep architectures. This work seeks to close this gap by systematically assessing and contrasting leading pre-trained FM architectures on a commonly used SE task, namely VoiceBank-Demand, and on the complex Deep Noise Suppression (DNS) challenge. Furthermore, three main ideas will be leveraged to boost FM-based SE models, namely: (i) Attention-based mask generation, (ii) consistency-preserving loss, and (iii) perceptual contrast stretching (PCS). Specifically, frame-level representations are effectively modeled using conformer layers, which leverage an attention mechanism. Inconsistency effects of signal reconstruction from the spectrogram are mitigated by incorporating consistency in the loss function. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. All FM-based models generate an Ideal Ratio Mask (IRM) from which the estimated clean speech is obtained. Experimental results on the VoiceBank-DEMAND task demonstrate that our approach helps close the gap between FM-based and SOTA SE solutions. When tested on the DNS challenge, the proposed FM-based SE solution compares favorably with previously proposed approaches on perceptual quality metrics, using only 10% of the available training material, though with a trade-off in signal fidelity (SI-SDR) when PCS preprocessing is applied.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom