Single‐signal entity approach for sung word recognition with artificial neural network and time–frequency audio features | Zendy

Khunarsa Peerapol | Zendy

Open Access

Single‐signal entity approach for sung word recognition with artificial neural network and time–frequency audio features

Author(s) -

Khunarsa Peerapol

Publication year - 2017

Publication title -

the journal of engineering

Language(s) - English

Resource type - Journals

ISSN - 2051-3305

DOI - 10.1049/joe.2017.0210

Subject(s) - speech recognition , computer science , artificial neural network , signal (programming language) , word (group theory) , audio signal , artificial intelligence , pattern recognition (psychology) , speech coding , mathematics , programming language , geometry

Singing voice recognition is very different from speech recognition or automatic speech recognition because there are distinct differences between speaking and singing voices. The problem is complex because music audio signals with their background instrumental accompaniments are regarded as noise sources that degrade the performance of the recognition system. This study proposes a statistical learning method to recognise words in a vocal audio signal with background music and to classify the region of a singing voice in a polyphonic audio signal. The goal of this study is to solve the problem of recognising words from sung input without using any method to separate instrumental from the background. This study also applies a concept from image recognition by using a spectrogram feature as an image to solve the problem. An audio signal with accompanying music was analysed and transformed into a spectrogram feature. To recognise it, the entire spectrogram feature was sliced, forming a feature vector for a feed‐forward neural network classifier. Several classification functions were compared, including K ‐Nearest Neighbour, Fisher Linear Classifier, Linear Bayes Normal Classifier, Naive Bayes Classifier, Parzen Classifier and Decision Tree. The results show that using a feed‐forward neural network can effectively recognise sung words at an accuracy rate of more than 93.0%. In particular, this system can recognise cross‐language music data.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research