AutoDiagn An Automated Real time Diagnosis Framework for Big Data Systems

Abstract

Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concurrent, and multi-tenant environments that can easily cause hardware and software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems’ performance degradation, perform root-cause analysis, and even overcome the issues causing such degradation. However, these solutions focus on specific problems such as stragglers and inefficient resource utilization. There is a lack of a generic and extensible framework to support the real-time diagnosis of big data systems. In this paper, we propose, develop and validate AutoDiagn. This generic and flexible framework provides holistic monitoring of a big data system while detecting performance degradation and enabling root-cause analysis. We present an implementation and evaluation of AutoDiagn that interacts with a Hadoop cluster deployed on a public cloud and tested with real-world benchmark applications. Experimental results show that AutoDiagn can offer a high accuracy root-cause analysis framework, at the same time as offering a small resource footprint, high throughput and low latency.

Existing System

? It can also be used to deploy and configure node-level metrics on to registered nodes. Because of this, when registering nodes it is required that credentials for each node be supplied (username, password or key). ? If node-level components and services have already been deployed by other tools they only have to register the already deployed node-level service endpoints. In this scenario credentials are not needed. ? Metrics version annotation is also supported by the dmon-controller. By this we mean that metrics pertaining to a specific application version can be annotated using tags. This way we can easily query, aggregate or even compare metrics of the application.

Disadvantages

? Developing a general and extensible framework for diagnosing a big data system is not trivial. ? It requires well-defined requirements which could enable the broader adoption of root-cause analysis for the big data systems, flexible APIs to interact with an underlying monitoring system and integration of multiple solutions for detecting performance reduction problems while enabling the automatic root-cause analysis. ? In this paper, we tackle this research gap, and design and develop AutoDiagn to automatically detect performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online root-cause analysis for a big data system.

Proposed System

• In most of reviewed platforms, analytics against collected monitoring data is handled via user-defined alerts. Although these provide valuable data for Ops teams, they do not provide the level of insight required by Dev teams for optimization and validation purposes. • More sophisticated, contextualized methods and tools are required. The DICE Anomaly Detection component is able to detect such anomalies and with the help of the DICE Enhancement tools will feedback this information into design-time models. • logstash-forwarder is designed for the purpose of log forwarding to one or more logstash server instances. By using this approach inside DMon we are eliminating node-level side effects caused by local processing of logs.

Advantages

? To the best of our knowledge, there is a lack of a generic and comprehensive solution for the detection of a wide range of anomalies and performance of root-cause analysis in big data systems. ? We develop a general framework called AutoDiagn which can be adapted for the detection of a wide range of performance degradation problems while pinpointing their rootcauses in big data systems. ? Visualizing the collected metrics and the results of root-cause analysis of any failures causing performance reduction in the cluster with a userfriendly interface in real-time.

Download DOC Download PPT