Schema Theory Based Data Engineering in Gene Expression Programming for Big Data Analytics
ABSTARCT :
Gene expression programming (GEP) is a data driven evolutionary technique that well suits for correlation mining. Parallel GEPs are proposed to speed up the evolution process using a cluster of computers or a computer with multiple CPU cores. However, the generation structure of chromosomes and the size of input data are two issues that tend to be neglected when speeding up GEP in evolution. To fill there search gap, this paper proposes three guiding principles to elaborate the computation nature of GEP in evolution based on an analysis of GEP schema theory. As a result, a novel data engineered GEP is developed which follows closely the generation structure of chromosomes in parallelization and considers the input data size in segmentation. Experimental results on two data sets with complementary features show that the data engineered GEP speeds up the evolution process significantly without loss of accuracy in data correlation mining. Based on the experimental tests, a computation model of the data engineered GEP is further developed to demonstrate its high scalability in dealing with potential big data using a large number of CPU cores.
EXISTING SYSTEM :
• Big data analytics has a close proximity to data min-ing approaches.
• Mining big data is more challenging than traditional data mining due to massive data volume.
• The common practice is to extend the existing data mining algorithms to cope with massive datasets, by executing on samples of big data and then merging the sample results
DISADVANTAGE :
• The flexible structure of GEP together withits black-box style in solution searching makes GEP an appealing analytic approach to big data problems.
? The other data set has strongly correlated data samples but each sample has a small number of input factors. Experimental results show that the data engineered GEP reduces the computation time significantly without loss of accuracy in processing the segmented data chunks,which makes it scalable in dealing with potential big data problems.
? these GEPs have not considered the size of an input data in parallelization leading to a scalability issue when dealing with an ever-growing size of potential big data.
PROPOSED SYSTEM :
? The method first applies principal component analysis for feature selection from diabetes dataset and estimates the best feature weights using mutual information.
? Afterwards, the method is applied to classify patients where modified cuckoo search is used to find the best value for Cand? parameters of the proposed methods.
? There are a lot of biclustering algorithms proposed by researchers. Cheng and Church propose a biclustering method that iteratively deletes and adds objects and features in a greedy manner.
ADVANTAGE :
? The proposed self-learning GEP provides a mechanism to maintain the structure of the accumulated schemata which leads to an enhanced performance.
? When the structure of a data set like the power system data is simple, the performance gain of parallelization can be easily offset by the computation over head incurred in maintaining these CPU threads.
? To evaluate the performance of the data engineered GEP, a number of experiments were conducted.
|