
Single‐Channel Speech Separation Based on Non‐negative Matrix Factorization and Factorial Conditional Random Field
Author(s) -
LI Xu,
TU Ming,
WANG Xiaofei,
WU Chao,
FU Qiang,
YAN Yonghong
Publication year - 2018
Publication title -
chinese journal of electronics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.267
H-Index - 25
eISSN - 2075-5597
pISSN - 1022-4653
DOI - 10.1049/cje.2018.06.016
Subject(s) - non negative matrix factorization , conditional random field , matrix decomposition , speech recognition , computer science , channel (broadcasting) , factorial , field (mathematics) , matrix (chemical analysis) , hidden markov model , markov random field , factorization , mathematics , source separation , algorithm , pattern recognition (psychology) , artificial intelligence , telecommunications , mathematical analysis , eigenvalues and eigenvectors , physics , materials science , composite material , quantum mechanics , segmentation , image segmentation , pure mathematics
A new Non‐negative matrix factorization (NMF) based algorithm is proposed for single‐channel speech separation with a prior known speakers, which aims to better model the spectral structure and temporal continuity of speech signal. First, NMF and k ‐means clus‐ tering are employed to obtain multiple small dictionaries as well as a state sequence that describes the temporal dynamics between these dictionaries for each speaker. Then, a Factorial conditional random field (FCRF) model is trained using the state sequences and dictionaries to jointly model the temporal continuity of two speakers' mixed signal for separation. Experiments show that the proposed algorithm outperforms the baselines with respect to all metrics, for example sparse NMF (+1.12dB SDR, +2.37dB SIR, +0.40dB SAR, +0.2 MOS), nonnegative factorial hidden Markov model (+2.04dB SDR, +4.26dB SIR, +0.62dB SAR, +1.0 MOS) and standard NMF (+2.8dB SDR, +5.08dB SIR, +1.06dB SAR, +1.2 MOS).