A Code Classification Method Based on TF-IDF
Author(s) -
Ke Wang,
JianHong Jiang,
Rui-Yun MA
Publication year - 2018
Publication title -
destech transactions on economics business and management
Language(s) - English
Resource type - Journals
ISSN - 2475-8868
DOI - 10.12783/dtem/eced2018/23926
Subject(s) - cosine similarity , computer science , cluster analysis , code (set theory) , similarity (geometry) , tf–idf , set (abstract data type) , data mining , document clustering , pattern recognition (psychology) , cluster (spacecraft) , information retrieval , feature (linguistics) , artificial intelligence , programming language , linguistics , philosophy , physics , quantum mechanics , term (time) , image (mathematics)
The main purpose of the study is to find the code with similar possibilities to effectively avoid the adverse effects of code duplication. Through the clustering pretreatment of document feature information, to extract the relevant features of the document. Then the basic characteristics are used to cluster the document, to find out the best number of clusters. According to the reasonable number of clusters that have been found, using the vectors that generated through TF-IDF method, combined the K-means clustering algorithm to distinguish the contents of the files, as well as the introduction of cosine similarity, to determine the similarity of two texts and classify the parallel documents. From the test data set, the method can accurately find the code with the possibility of duplication and works quiet well.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom