
Spam text classification using LSTM Recurrent Neural Network
Publication year - 2021
Publication title -
international journal of emerging trends in engineering research
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.218
H-Index - 14
ISSN - 2347-3983
DOI - 10.30534/ijeter/2021/11992021
Subject(s) - artificial intelligence , computer science , recurrent neural network , support vector machine , naive bayes classifier , machine learning , class (philosophy) , convolutional neural network , f1 score , false positive paradox , set (abstract data type) , artificial neural network , pattern recognition (psychology) , natural language processing , programming language
Sequence Classification is one of the on-demand research projects in the field of Natural Language Processing (NLP). Classifying a set of images or text into an appropriate category or class is a complex task that a lot of Machine Learning (ML) models fail to accomplish accurately and end up under-fitting the given dataset. Some of the ML algorithms used in text classification are KNN, Naïve Bayes, Support Vector Machines, Convolutional Neural Networks (CNNs), Recursive CNNs, Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM), etc. For this experimental study, LSTM and a few other algorithms were chosen for a more comparative study. The dataset used is the SMS Spam Collection Dataset from Kaggle and 150 more entries were additionally added from different sources. Two possible class labels for the data points are spam and ham. Each entry consists of the class label, a few sentences of text followed by a few useless features that are eliminated. After converting the text to the required format, the models are run and then evaluated using various metrics. In experimental studies, the LSTM gives much better classification accuracy than the other machine learning models. F1-Scores in the high nineties were achieved using LSTM for classifying the text. The other models showed very low F1-Scores and Cosine Similarities indicating that they had underperformed on the dataset. Another interesting observation is that the LSTM had reduced the number of false positives and false negatives than any other model.