Improving Failure Tolerance in Large-Scale Cloud Computing Systems

Abstract

Fault tolerance is a major concern to guarantee availability and reliability of critical services as well as application execution. In order to minimize failure impact on the system and application execution, failures should be anticipated and proactively handled. Fault tolerance techniques are used to predict these failures and take an appropriate action before failures actually occur. This paper discusses the existing fault tolerance techniques in cloud computing based on their policies, tools used and research challenges. Cloud virtualized system architecture has been proposed. In the proposed system autonomic fault tolerance has been implemented. The experimental results demonstrate that the proposed system can deal with various software faults for server applications in a cloud virtualized environment.

Existing System

Cloud computing has multiple characteristics; high availability and reliability are most essential cloud characteristics, the traditional way for achieving reliable and highly available cloud service is to use a fault tolerance module. The simplest definition of fault tolerant system is the ability to continue computing processing despite the presence of a hardware failure. The failures that occur in cloud computing can be classified into two classes namely; First, Data Failures such as data corruption, missing source data and other flaws in the data. Second, Hardware failure such as faulty or slow VMs and storage access exception.

Disadvantages

? There are more chances of errors because processing is done on remote computers. Failures occurring in the data centers are not in the scope of the user’s organization necessitating the implementation of an autonomous fault tolerance technique for applications computing on cloud environment. ? It is difficult to interpret the changing system state because cloud environment are dynamically scalable, unexpected and often virtualized resources are provided as a service. ? Limited information is provided to the users because of high system complexity, so it is difficult to design an optimal fault tolerance solution. ? Fault Prediction and Monitoring framework needs to be developed for real time applications that execute in cloud environment.

Proposed System

In order to achieve the objective “study and analyze various fault tolerance and fault detection techniques in cloud computing” a comprehensive literature survey was carried out for cloud computing and various fault detection and fault tolerance techniques implemented in cloud computing. An extensive literature review was carried out for various models of artificial neural networks which can be used for fault detection. Our proposed failure detector is based on Heartbeat strategy which uses Artificial Neural Network for the estimation of expected arrival time from a virtual machine. To implement the proposed algorithm using Cloud sim the work is encompassed as. i. The monitoring process q uses an estimated value (TO) which conveys q how much time it has to wait for the next heartbeat message from a process p. ii. If after TO, qdoes not receive the heartbeat message from p, it will start suspecting p. iii. TO is allowed to change over time to make it adaptive with actual communication loads. iv. Time interval (TO) comprise of two values: the estimated time for the arrival of the next heartbeat message (ET) and the safety margin (a). The safety margin computed by ANN will help the detector to avoid false detections. TO=ET+ a

Advantages

? Pro-active fault tolerance mechanism designed for dynamic clouds using Artificial Neural Network for fault detection can prove more beneficial than traditional models. ? The algorithm will provide detection time that is independent from the last heartbeat message, thus making the failure detector adaptive and increasing its accuracy.

Download DOC Download PPT