Parallel Semi-Supervised Big Data Clustering Based on Mapreduce Technology | Zendy

A. Ghobashy M. | Zendy; S. Duraisamy | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Parallel Semi-Supervised Big Data Clustering Based on Mapreduce Technology

Author(s) -

A. Ghobashy M.,

S. Duraisamy

Publication year - 2019

Publication title -

international journal of recent technology and engineering (ijrte)

Language(s) - English

Resource type - Journals

ISSN - 2277-3878

DOI - 10.35940/ijrte.c5206.118419

Subject(s) - cluster analysis , computer science , big data , data mining , correlation clustering , cure data clustering algorithm , consensus clustering , data stream clustering , canopy clustering algorithm , clustering high dimensional data , graph , artificial intelligence , theoretical computer science

In the area of information technology, a speedy sensational technology is big data. Big data brings tremendous challenges to extract valuable hidden knowledge. Data mining techniques can be used over big data to extract valuable knowledge for decision making. Big data results in high heterogeneity because it consists of various inter-related kinds of objects such as audios, texts, and images. In addition to this, the inter-related kinds of objects carry different information. So, in this paper clustering techniques are introduced to separate objects into several clusters. It also reduces the computational complexity of classifiers. A Possibilistic c-Means (PCM) algorithm was introduced to group the objects in big data. PCM replicated the characteristic of each object to different clusters effectively and it had capability to avoid the corruption of noise in the clustering process. However, PCM is not more efficient for big data and it cannot confine the complex correlation over multiple modalities of the heterogeneous data objects. So, a Parallel Semi-supervised Multi-Ant Colonies Clustering (PSMACC) is introduced for big data clustering. Initially, the PSMACC splits the data into number of partitions and each partition is processed in mappers. Each mapper generates a diverse collection of three clustering components using the semisupervised ant colony clustering algorithm with various moving speeds. Then, a hyper graph model was used to combine three clustering components. Finally, two constraints such as MustLink (ML) and Cannot-Link (CL) are included to form a consensus clustering. Finally, the intermediate results of each mapper are combined in the reducer. However, the overhead of iteration in PSMACC is overwhelming which affects the performance of PSMACC. So, a Parallel Semi-supervised MultiImperialist Competitive Algorithm (PSMICA) is proposed to cluster the big data. In PSMICA, each mapper processes the ICA where initial population is called countries. Some of the best countries in the population chosen as the imperialists and the remaining countries form the colonies of these imperialists. The colonies move towards the imperialists based on the distance between them. The intermediate results of each mapper are combined in reducer to get the final clustering result.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research