Fault Injection Analytics A Novel Approach to Discover Failure Modes in Cloud Computing Systems

Abstract

Cloud computing systems fail in complex and unexpected ways, due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the Open Stack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.

Existing System

Fault resolution in communication networks and distributed systems is a complicated process that demands the involvement of system administrators and supporting systems in monitoring, diagnosing, resolving and recording faults. This process becomes more challenging in inter-cloud environment where multiple cloud systems coordinate in provisioning applications and services.

Disadvantages

? They use the same data format that contains pre-defined fields used to keep track of the status of the problem while textual descriptions are used to describe the problem. ? They also establish a reasoning engine that allows searching for similar past problems, and providing reports and statistics for performance evaluation of the services.

Proposed System

we propose a fault resolution system that assists system administrators in resolving faults in inter cloud environment. The proposed system is characterized by the capability of sharing and searching fault knowledge resources among cloud systems for fault resolution. It uses a peer-to-peer network of fault managers that provide facilities to monitor faults occurring in cloud systems and search similar faults with solutions occurring in other cloud systems. We have implemented several components of the proposed system including fault monitor, fault searcher and fault updater. We have also experimented and evaluated the prototyping system on fault databases obtained from several fault sources, such as bug tracking systems, online discussion forums and vendor knowledge bases.

Advantages

? The approach also proposes novel way to characterize the behavior by decomposing performance along the dimensions of time, space, and volume. ? The approach can be useful for system administrators to debug performance problems in Map Reduce systems. These studies deal with the same challenge: fault monitoring and diagnosis, or failure prediction. ? we measure the capability of the system to contribute a large amount of fault data while minimizing data transfer, retrieve a large number of query hits with short responding time.

Download DOC Download PPT