z-logo
Premium
New hybrid data mining model for prediction of Salmonella presence in agricultural waters based on ensemble feature selection and machine learning algorithms
Author(s) -
Buyrukoğlu Selim
Publication year - 2021
Publication title -
journal of food safety
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.427
H-Index - 43
eISSN - 1745-4565
pISSN - 0149-6085
DOI - 10.1111/jfs.12903
Subject(s) - random forest , support vector machine , ensemble learning , feature selection , cluster analysis , computer science , ensemble forecasting , artificial intelligence , naive bayes classifier , artificial neural network , machine learning , data set , data mining , feature (linguistics) , algorithm , philosophy , linguistics
This paper aims to create a new hybrid ensemble data mining model to predict the Salmonella presence in agricultural surface waters based on the combination of heterogeneous ensemble approach for feature selection, clustering, regression, and classification algorithms. The data set for this study was collected from six agricultural ponds in Central Florida consisting of 23 features with 540 instances (26 Salmonella positive and 514 Salmonella negative). The model consisted of three stages. Initially, a heterogeneous ensemble feature selection (HEFS) approach was applied to select top features. Then, the k‐means clustering algorithm was implemented to remove misclassified cases from the data set. Finally, classification and regression algorithms, including support vector machine (SVM), Naïve Bayes (NB), Artificial Neural Network (ANN), Random Forest (RF) with soft voting approach were applied to the preprocessed data set to predict the Salmonella presence in agricultural surface waters with the amount of test set (20%). These algorithms were combined in 10 different ensemble models through the soft voting approach. The performance of these hybrid ensemble models was also evaluated. The ensemble ANN + RF model achieved the highest performance and outperformed all other single and ensemble models based on Area under the ROC Curve (AUC) (0.98) and prediction accuracy (94.9%). The findings emphasize the validity of our hybrid ensemble model which encourages researchers to predict Salmonella presence in agricultural surface waters.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here