z-logo
Premium
Big data clone detection using classical detectors: an exploratory study
Author(s) -
Svajlenko Jeffrey,
Keivanloo Iman,
Roy Chanchal K.
Publication year - 2015
Publication title -
journal of software: evolution and process
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.371
H-Index - 29
eISSN - 2047-7481
pISSN - 2047-7473
DOI - 10.1002/smr.1662
Subject(s) - computer science , scalability , big data , source code , code (set theory) , clone (java method) , data mining , data science , database , programming language , dna , genetics , set (abstract data type) , biology
Big data analysis is an emerging research topic in various domains, and clone detection is no exception. The goal is to create big data inter‐project clone corpora across open‐source or corporate‐source code repositories. Such corpora can be used to study developer behavior and to reduce engineering costs by extracting globally duplicated efforts into new APIs and as a basis for code completion and API usage support. However, building scalable clone detection tools is challenging. It is often impractical to use existing state‐of‐the‐art tools to analyze big data because the memory and execution time required exceed the average user's resources. Some tools have inherent limitations in their data structures and algorithms that prevent the analysis of big data even when extraordinary resources are available. These limitations are impossible to overcome if the source code of the tool is unavailable or if the user lacks the time or expertise to modify the tool without harming its performance or accuracy. In this research, we have investigated the use of our shuffling framework for scaling classical clone detection tools to big data. The framework achieves scalability on commodity hardware by partitioning the input dataset into subsets manageable by the tool and computing resources. A non‐deterministic process is used to randomly ‘shuffle’ the contents of the dataset into a series of subsets. The tool is executed for each subset, and its output for each is merged into a single report. This approach does not require modification to the subject tools, allowing their individual strengths and precision to be captured at an acceptable loss of recall. In our study, we explored the performance and applicability of the framework for the big data dataset, IJaDataset 2.0, which consists of 356 million lines of code from 25,000 open‐source Java projects. We begin with a computationally inexpensive version of our framework based on pure random shuffling. This version was successful at scaling the tools to IJaDataset but required many subsets to achieve a desirable recall. Using our findings, we incrementally improved the framework to achieve a satisfactory recall using fewer resources. We investigated the use of efficient file tracking and file‐similarity heuristics to bias the shuffling algorithm toward subsets of the dataset that contain undetected clone pairs. These changes were successful in improving the recall performance of the framework. Our study shows that the framework is able to achieve up to 90–95% of a tool's native recall using standard hardware. Copyright © 2014 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here