z-logo
open-access-imgOpen Access
An improved Simhash algorithm based malicious mirror website detection method
Author(s) -
Guangxuan Chen,
Guangxiao Chen,
Di Wu,
Qiang Liu,
Lei Zhang,
Xiaoshi Fan
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1971/1/012067
Subject(s) - computer science , web page , data deduplication , similarity (geometry) , phishing , information retrieval , the internet , data mining , algorithm , world wide web , database , image (mathematics) , artificial intelligence
There are a large number of similar or even identical webpages on the Internet. These webpages will cause unnecessary loss of network resources, including waste of storage space, decreased web search speed, and decreased user experience. And some malicious mirror websites will become tools for criminals to carry out illegal activities such as phishing attacks. In this paper, the autours analyzed the mainstream text similarity detection algorithms and webpage deduplication algorithms, and proposed an improved webpage deduplication algorithm based on Simhash. The algorithm converts the text collection into Simhash fingerprints for storage through mapping, and calculates the similarity of the two fingerprints through Hamming distance, thereby obtaining the similarity of the webpage. Experiments show that the algorithm proposed in this paper has a higher accuracy rate and recall rate, and can be better applied to the identification and detection of malicious mirror websites.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here