MLDEG A Machine Learning Approach to Identify Differentially Expressed Genes Using Network Property and Network Propagation

Abstract

Motivation: Identifying differentially expressed genes (DEGs) in transcriptome data is a very important task. However, performances of existing DEG methods vary significantly for data sets measured in different conditions and no single statistical or machine learning model for DEG detection perform consistently well for data sets of different traits. In addition, setting a cutoff value for the significance of differential expressions is one of confounding factors to determine DEGs. Results: We address these problems by developing an ensemble model that refines the heterogeneous and inconsistent results of the existing methods by taking accounts into network information such as network propagation and network property. DEG candidates that are predicted with weak evidence by the existing tools are re-classified by our proposed ensemble model for the transcriptome data. Tested on 10 RNA-seq datasets downloaded from gene expression omnibus (GEO), our method showed excellent performance of winning the first place in detecting ground truth (GT) genes in eight datasets and find almost all GT genes in six datasets. On the other hand, performances of all existing methods varied significantly for the 10 data sets. Because of the design principle, our method can accommodate any new DEG methods naturally.

Existing System

? It can be seen clearly from this figure that there existed four distinct clusters corresponding to the four classes in the data. ? In this work, we aim to use a hybrid approach that harnesses the power of both machine learning and network biology to provide new insights and improve understanding of cancer etiology, particularly related to the existence of Class II cancer genes. ? Although gene-gene associations found in MALANI-derived cancer networks do not necessarily imply causality, our machine learning-based reverse engineered cancer networks provide key information regarding the existence of Class II cancer genes, which link to Class I cancer genes in order to complete oncogenic signaling in cancer networks.

Disadvantages

? The large amount of expression data generated by this technology makes the study of certain complex biological problems possible and machine learning methods are playing a crucial role in the analysis process. ? We first identify the major types of the classification problems; then apply several machine learning methods to solve the problems and perform systematic tests on real and artificial datasets. ? It is especially useful in handling real world problems that have the following properties (Freund and Schapire, 1996): the samples have various degrees of hardness to learn and the learner is sensitive to the change of training samples.

Proposed System

• The neural network feature selector algorithm proposed by Setiono and Liu uses a fixed partition of training and cross-validation sample set as initial input. • Instead of simply splitting the sample set into two partitions, the neural network feature selector proposed in this thesis employs leave-one-out method to reduce the bias in estimating the generalization performance. • The hybridization of LIK+RFE on SRBCT dataset also significantly outperformed a neural network method proposed by other researchers. • In particular, we improve a neural network feature selector method, developed multivariate likelihood feature selection method, and propose a hybrid framework of univariate and multivariate feature selection method.

Advantages

? We apply LR, KNC , SVM , GNB , DTC , and RFC on PD and BRCA data and evaluate their performance in terms of accuracy, sensitivity, specificity, and precision. ? The use of log2FC estimates and knowledge of prior gene regulation with a DNN enable the capture of the non-linear patterns from biologically validated gene samples and improve the prediction performance of our model in determining UR and DR genes. ? We used confusion matrix to calculate the accuracy, sensitivity (recall), specificity, and precision for evaluating the performance of our model. ? We also compare the performance of DEGnet with six other ML-based methods for both PD and BRCA datasets. ? We compare the performance of DEGNet with the same six other ML-based methods in terms of accuracy, sensitivity, specificity, and precision.

Download DOC Download PPT