
Statistical Analysis for Twitter Spam Detection
Author(s) -
Ganesh Udge,
Mahesh Mohite,
Shubhankar Bendre,
Yogeshwar Birnagal,
Disha Sushant Wankhede
Publication year - 2019
Publication title -
international journal of scientific research in science, engineering and technology
Language(s) - English
Resource type - Journals
eISSN - 2395-1990
pISSN - 2394-4099
DOI - 10.32628/ijsrset1962170
Subject(s) - computer science , naive bayes classifier , bag of words model , support vector machine , categorization , classifier (uml) , hash function , machine learning , information retrieval , artificial intelligence , data mining , computer security
The spreading and learning of new discoveries and information is made available using current online social networks. In Recent days, the solutions may be irrelevant to the actual content; also termed as attacks in the layman’s term such attacks are been performed on Twitter as well and called as Twitter spammers. The quality of data is being compromised by addition of malicious and harmful information using URL, bio, emoticons, audio, images/videos & hash-tags through different accounts by exchanging tweets, personal messages (Direct Message’s) & re-tweets. Misleading sites may be linked with the malicious links which may affect adverse effects on the user and also interfere in their decision making processes. To improve user-experience from the spammers attacks, the training twitter dataset are applied and then by extracting and using the 12 lightweight features like user’s age, number of followers, count of tweets and re-tweets, etc. are used to distinguish the spam from non-spam. For enhancing the performance, the discretization of the function is important for transmission of spam detection between tweets. Our system creates classification model for Spam detection which includes binary classification and automatic learning algorithms viz. Naïve Bayes classifier or Support Vector Machine classifier which understands the behaviour of the model. The system will categorize the tweets from datasets into Spam and Non-spam classes and provide the user’s feed with only the relevant information. The system will report the impact of data-related factors such as relationship between spam and non-spam tweets, size of training dataset, data sampling and detection performance. The proposed system’s function is detection and analysis of the simple and variable twitter spam over time. The spam detection is a major challenge for the system and shortens the gap between performance appraisals and focuses primarily on data, features and patterns to identify real user and informing it about the spam tweets along with the performance statistics. The work is to detect spammed tweets in real time, since the new tweets may show patterns and this will help for training and updating dataset and in knowledge base.