
INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS
Author(s) -
Dipak V. Patil,
Rajankumar S. Bichkar
Publication year - 2014
Publication title -
computing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.184
H-Index - 11
eISSN - 2312-5381
pISSN - 1727-6209
DOI - 10.47839/ijc.11.3.565
Subject(s) - computer science , data mining , decision tree , decision tree learning , outlier , data quality , data cleansing , data set , incremental decision tree , anomaly detection , sampling (signal processing) , speedup , set (abstract data type) , classifier (uml) , machine learning , filter (signal processing) , artificial intelligence , metric (unit) , operations management , economics , computer vision , programming language , operating system
The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.