INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS | Zendy

Dipak V. Patil | Zendy; Rajankumar S. Bichkar | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

Author(s) -

Dipak V. Patil,

Rajankumar S. Bichkar

Publication year - 2014

Publication title -

computing

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.184

H-Index - 11

eISSN - 2312-5381

pISSN - 1727-6209

DOI - 10.47839/ijc.11.3.565

Subject(s) - computer science , data mining , decision tree , decision tree learning , outlier , data quality , data cleansing , data set , incremental decision tree , anomaly detection , sampling (signal processing) , speedup , set (abstract data type) , classifier (uml) , machine learning , filter (signal processing) , artificial intelligence , metric (unit) , operations management , economics , computer vision , programming language , operating system

The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore