
FAULT TOLERATING MECHANISM IN DISTRIBUTED COMPUTING ENVIRONMENT
Author(s) -
Lokendra Gour,
Akhilesh A. Waoo
Publication year - 2020
Publication title -
international journal of engineering applied science and technology
Language(s) - English
Resource type - Journals
ISSN - 2455-2143
DOI - 10.33564/ijeast.2020.v05i04.096
Subject(s) - mechanism (biology) , distributed computing , computer science , distributed computing environment , physics , quantum mechanics
Large scale distributed systems encompass heterogeneous computational machines, workloads and sub-systems dispersed diversely across the cloud environment. These sub-systems frequently encounter faults and failures due to different data structures, hardware/software malfunction, and communication delay. To speed up computation in such a situation a fault tolerating infrastructure is implemented by adopting a machine learning approach. Under machine learning, an artificial neural network (ANN) captures, manipulates, and updates the states and behaviors of the sub-systems in the servers and worker's machines. Multiple layers of neurons (i. e., deep learning) can handle large scale distributed systems with large datasets. Adopting the variants of a stochastic gradient descend algorithm on subsystems (also known as computational nodes) the efficiency, and reliability of a distributed system are enhanced significantly. In high-performance computing (HPC) applications fault tolerance mechanisms must be embedded to recover from system failures. Keywords— Distributed System, Cloud Environment, Fault Tolerance, Machine Learning, Artificial Neural Network