z-logo
open-access-imgOpen Access
Efficient analytics on ordered datasets using MapReduce
Author(s) -
Jiangtao Yin,
Yong Liao,
Mario Baldi,
Lixin Gao,
Antonio Nucci
Publication year - 2013
Publication title -
citeseer x (the pennsylvania state university)
Language(s) - English
Resource type - Conference proceedings
ISBN - 978-1-4503-1910-2
DOI - 10.1145/2462902.2462930
Subject(s) - computer science , merge (version control) , speedup , analytics , big data , workload , business intelligence , sort , execution time , database transaction , data mining , database , parallel computing , operating system
Efficiently analyzing data on a large scale can be vital for data owners to gain useful business intelligence. One of the most common datasets used to gain business intelligence is event log files. Oftentimes, records in event log files that are time sorted, need to be grouped by user ID or transaction ID in order to mine user behaviors, such as click through rate, while preserving the time order. This kind of analytical workload is here referred to as RElative Order-pReserving based Grouping (Re-Org). Using MapReduce/Hadoop, a popular big data analysis tool, in an as-is manner for executing Re-Org tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. We propose a framework that adopts an efficient group-order-merge mechanism to provide faster execution of Re-Org tasks and implement it by extending Hadoop. Experimental results show a 2.2x speedup over executing Re-Org tasks in plain vanilla Hadoop.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom