A Performance Comparison of Big Data Processing Platform Based on Parallel Clustering Algorithms | Zendy

Mo Hai | Zendy; Yuejing Zhang | Zendy; Haifeng Li | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

A Performance Comparison of Big Data Processing Platform Based on Parallel Clustering Algorithms

Author(s) -

Mo Hai,

Yuejing Zhang,

Haifeng Li

Publication year - 2018

Publication title -

procedia computer science

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.334

H-Index - 76

ISSN - 1877-0509

DOI - 10.1016/j.procs.2018.10.228

Subject(s) - computer science , cluster analysis , spark (programming language) , big data , node (physics) , set (abstract data type) , parallel computing , data set , parallel processing , data mining , fuzzy clustering , algorithm , artificial intelligence , structural engineering , engineering , programming language

The performance of three typical big data processing platform: Hadoop, Spark and DataMPI are compared based on different parallel clustering algorithms: parallel K-means, parallel fuzzy K-means and parallel Canopy. Experiments are performed on different text as well as numeric dataset and clusters of different scale. The results show that: (1) for the same data set, when the memory of each node is 4GB, DataMPI can achieve about 60% performance improvement compared with Hadoop, and can achieve about 32% performance improvement compared with Spark; (2) in order to obtain a high clustering performance, a cluster with 6 nodes and 6GB memory of each node should be selected.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research