Premium
An efficient iterative graph data processing framework based on bulk synchronous parallel model
Author(s) -
Liu Chao,
Zeng Deze,
Yao Hong,
Yan Xuesong,
Yu Linchen,
Fu Zhangjie
Publication year - 2018
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.4432
Subject(s) - computer science , graph , big data , graph database , theoretical computer science , hash function , pagerank , distributed computing , load balancing (electrical power) , parallel computing , graph partition , power graph analysis , synchronization (alternating current) , data mining , computer network , channel (broadcasting) , geometry , computer security , mathematics , grid
Summary Graph data processing has been widely applied in a variety of domains such as industry, science, social network, and so on. It therefore has stimulated many efforts devoted to this area. To embrace the fast development trend of big graph data, graph data processing based on Pregel‐like systems has been regarded as one of the most promising ways and has widely attracted the attention of researchers. However, it still remains in its early stage and there still exist many challenges. In Pregel, the superstep synchronization is time consuming as the graph data iteration operation requires multiple synchronizations. Furthermore, the graph data partition strategy adopted by Pregel fails to support load balancing, therefore causing the increase of network I/O overhead as the scale of graph data grows. To address these issues, this paper presents an efficient computational framework for graph data processing based on the bulk synchronous parallel model. The global synchronization control mechanism is improved by determining the start time of the next round of superstep through counting the number of global message files. Furthermore, an improved graph data partition mechanism based on a balanced hash method is proposed to reduce the communication overhead between different partitions of sub‐graph computational tasks. We also re‐design the PageRank algorithm to verify the effectiveness of the proposed framework. Experimental results on different real‐world datasets verify the efficiency of our proposed framework as it outperforms Giraph (an open source Pregel‐like system) by 58 % −69 % , and achieves 10×−17× performance improvement over Hadoop.