TOWARDS A CLOUD-BASED APPROACH FOR SPAM URL DEDUPLICATION FOR BIG DATASETS
Author(s) -
Shams Zawoad,
Ragib Hasan,
Gary Warner,
Md Munirul Haque
Publication year - 2014
Publication title -
services transactions on cloud computing
Language(s) - English
Resource type - Journals
eISSN - 2326-7550
pISSN - 2326-7542
DOI - 10.29268/stcc.2014.0008
Subject(s) - data deduplication , computer science , cloud computing , big data , database , data science , data mining , information retrieval , world wide web , operating system
Spam emails are often used to advertise phishing websites and lure users to visit such sites. URL blacklisting is a widely used technique for blocking malicious phishing websites. To prepare an effective blacklist, it is necessary to analyze possible threats and include the identified malicious sites in the blacklist. However, the number of URLs acquired from spam emails is quite large. Fetching and analyzing the content of this large number of websites are very expensive tasks given limited computing and storage resources. To solve the problem of massive computing and storage resource requirements, we need a highly distributed and scalable architecture, where we can provision additional resources to fetch and analyze on the fly. Moreover, there is a high degree of redundancy in the URLs extracted from spam emails, where more than one spam emails contain the same URL. Hence, preserving the contents of all the websites causes significant storage waste. Additionally, fetching content from a fixed IP address introduces the possibility of being reversed blacklisted by malicious websites. In this paper, we propose and develop CURLA – a Cloud-based spam URL Analyzer, built on top of Amazon Elastic Computer Cloud (EC2) and Amazon Simple Queue Service (SQS). CURLA allows deduplicating large number of spam-based URLs in parallel, which reduces the cost of establishing equally capable local infrastructure. Our system builds a database of unique spam-based URL and accumulates the content of these unique websites in a central repository. This database and website repository will be a great resource to identify phishing websites and other counterfeit websites. We show the effectiveness of our architecture using real-life, large-scale spam-based URL data.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom