z-logo
open-access-imgOpen Access
Content-based prediction: big data sampling perspective
Author(s) -
Waleed Albattah,
Saleh Albahli
Publication year - 2019
Publication title -
international journal of engineering and technology
Language(s) - English
Resource type - Journals
ISSN - 2227-524X
DOI - 10.14419/ijet.v8i4.30150
Subject(s) - terabyte , petabyte , computer science , sampling (signal processing) , machine learning , big data , support vector machine , artificial intelligence , data mining , process (computing) , random forest , perspective (graphical) , filter (signal processing) , computer vision , operating system
Today, large volumes of data are actively generated on the order of terabytes or even petabytes. Hence, processing data on such a large scale in an efficient and effective manner is extremely challenging. However, existing research studies apply machine learning algorithms by loading the entire training dataset into the computer’s main memory (RAM). This causes a problem as the data grows too big over time and can’t be supported by most of the conventional models or hardware within a single machine’s memory. Inspired by current research studies, this paper discusses the benefits of implementing two sampling techniques that could be used for machine learning models: (1) sampling with replacement and (2) reservoir sampling. In this study, 40 experiments were performed by reducing the number of data instances by 50% of the original data using random sampling of a video dataset that was more than 40 GB in size. Remark that accuracies of SVM and random forest are very competitive classifiers and give the importance score of all repeated ten rounds of the process for each of the four combinations of sampling techniques and machine learning classifiers.  

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here