Premium
A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data
Author(s) -
Kılınç Deniz
Publication year - 2019
Publication title -
software: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.437
H-Index - 70
eISSN - 1097-024X
pISSN - 0038-0644
DOI - 10.1002/spe.2724
Subject(s) - sentiment analysis , computer science , big data , spark (programming language) , naive bayes classifier , context (archaeology) , service (business) , dashboard , data mining , data science , machine learning , artificial intelligence , information retrieval , support vector machine , paleontology , economy , biology , economics , programming language
Summary There are many data sources that produce large volumes of data. The Big Data nature requires new distributed processing approaches to extract the valuable information. Real‐time sentiment analysis is one of the most demanding research areas that requires powerful Big Data analytics tools such as Spark. Prior literature survey work has shown that, though there are many conventional sentiment analysis researches, there are only few works realizing sentiment analysis in real time. One major point that affects the quality of real‐time sentiment analysis is the confidence of the generated data. In more clear terms, it is a valuable research question to determine whether the owner that generates sentiment is genuine or not. Since data generated by fake personalities may decrease accuracy of the outcome, a smart/intelligent service that can identify the source of data is one of the key points in the analysis. In this context, we include a fake account detection service to the proposed framework. Both sentiment analysis and fake account detection systems are trained and tested using Naïve Bayes model from Apache Spark's machine learning library. The developed system consists of four integrated software components, ie, (i) machine learning and streaming service for sentiment prediction, (ii) a Twitter streaming service to retrieve tweets, (iii) a Twitter fake account detection service to assess the owner of the retrieved tweet, and (iv) a real‐time reporting and dashboard component to visualize the results of sentiment analysis. The sentiment classification performances of the system for offline and real‐time modes are 86.77% and 80.93%, respectively.