z-logo
Premium
HaDaap: A hotness‐aware data placement strategy for improving storage efficiency in heterogeneous Hadoop clusters
Author(s) -
Xiong Runqun,
Du Yao,
Jin Jiahui,
Luo Junzhou
Publication year - 2018
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.4830
Subject(s) - computer science , erasure code , replication (statistics) , data redundancy , big data , data center , distributed data store , sort , redundancy (engineering) , computer data storage , distributed computing , database , data mining , operating system , algorithm , statistics , decoding methods , mathematics
Summary Enterprises increasingly use the Hadoop Distributed File System (HDFS) to manage and store big data for many applications. However, HDFS uses triple replication, leading to staggering data center storage costs. As big data increases in volume and its heat levels becomes more sensitive, there comes a point where storing so much cold data actually makes it less accessible and more expensive. Meanwhile, as data centers expand, the heterogeneity of nodes also becomes an issue. Rack‐aware data placement adopted by HDFS results in an unbalanced load and uneven resource allocation because it ignores the data nodes' heterogeneity. Here, we attempt to resolve these problems by proposing a hotness‐aware data placement strategy (named HaDaap). In HaDaap, the first step is to use a hotness‐aware data clustering algorithm to set the data's degree of heat. Then, cold data (with a redundancy of erasure code) are placed through a Double Sort Exchange algorithm to reduce storage costs and increase data availability. Finally, hot data are placed via a dynamic replication placement mechanism that comprehensively factors availability, load, and storage costs. Experimental results show that with these enhancements, HaDaap uses resources rationally and substantially reduces storage costs by considering the difference of data hotness in heterogeneous Hadoop clusters.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here