Predicting Machine Learning Pipeline Runtimes in the Context of Automated Machine Learning
ABSTARCT :
? Automated Machine Learning (AutoML) seeks to automatically find so-called machine learning pipelines that maximize the prediction performance when being used to train a model on a given dataset. One of the main and yet open challenges in AutoML is an effective use of computational resources: An AutoML process involves the evaluation of many candidate pipelines, which are costly but often ineffective because they are canceled due to a timeout. In this paper, we present an approach to predict the runtime of two-step machine learning pipelines with up to one pre-processor, which can be used to anticipate whether or not a pipeline will time out. Separate runtime models are trained offline for each algorithm that may be used in a pipeline, and an overall prediction is derived from these models. We empirically show that the approach increases successful evaluations made by an AutoML tool while preserving or even improving on the previously best solutions.
EXISTING SYSTEM :
? We introduce a mathematical formulation covering the complete procedure of automatic ML pipeline synthesis and compare it with existing problem formulations.
? We review open-source frameworks for building ML pipelines automatically.
? An evaluation of eight HPO algorithms on 137 real data sets is conducted. To the best of our knowledge, this is the first independent benchmark of HPO algorithms.
? An empirical evaluation of six AutoML frameworks on 73 real data sets is performed. To the best of our knowledge, this is the most extensive evaluation—in terms of tested frameworks as well as used data sets—of AutoML frameworks.
DISADVANTAGE :
? Task type and subtype categories with controlled vocabularies to categorize problems into a wide range of tasks, beyond just basic classification and regression.
? Which performance metrics the AutoML system should optimize for.
? Which columns in a dataset are target columns.
? A list of privileged data columns related to unavailable attributes during testing. These columns do not have data available in the test split of a dataset.
? Additional metadata like human-friendly name and description.
PROPOSED SYSTEM :
? Many different algorithms have been proposed to solve specific problem instances efficiently, for example convex optimization.
? To use these methods, the features and shape of the underlying objective function—in this case the loss L—have to be known to select applicable solvers.
? In general, it is not possible to predict any properties of the loss function or even formulate it as closed-form expression as it depends on the generative model.
ADVANTAGE :
? The framework should allow most of ML and data processing programs to be described as its pipelines, if not all, but be as simple as possible to facilitate both automatic generation and automatic consumption of pipelines.
? Pipelines should allow description of complete end-to-end ML programs, starting with raw files and finishing with predictions or any other ML output from models embedded in pipelines.
? The focus of the framework is machine generation and consumption as opposed to human generation and consumption. It should enable automation as much as possible.
? The framework should be extensible and framework’s components should be decoupled from each other, cf. in most programming languages a typing system and execution semantics are tightly coupled with the language itself.
? Control of side-effects and randomness in pipelines, and in general full reproducibility should be part of the framework and not an afterthought.
|