Applying a random projection algorithm to optimize machine learning model for breast lesion classification

Abstract

Machine learning is widely used in developing computer-aided diagnosis (CAD) schemes of medical images. However, CAD usually computes large number of image features from the targeted regions, which creates a challenge of how to identify a small and optimal feature vector to build robust machine learning models. In this study, we investigate feasibility of applying a random projection algorithm to build an optimal feature vector from the initially CAD-generated large feature pool and improve performance of machine learning model. We assemble a retrospective dataset involving 1,487 cases of mammograms in which 644 cases have confirmed malignant mass lesions and 843 have benign lesions. A CAD scheme is first applied to segment mass regions and initially compute 181 features. Then, support vector machine (SVM) models embedded with several feature dimensionality reduction methods are built to predict likelihood of lesions being malignant. All SVM models are trained and tested using a leave-one-case-out cross-validation method. SVM generates a likelihood score of each segmented mass region depicting on one-view mammogram. By fusion of two scores of the same mass depicting on two-view mammograms, a case-based likelihood score is also evaluated. Comparing with the principle component analyses, nonnegative matrix factorization, and Chi-squared methods, SVM embedded with the random projection algorithm yielded a significantly higher case-based lesion classification performance with the area under ROC curve of 0.84±0.01 The study demonstrates that the random project algorithm is a promising method to generate optimal feature vectors to help improve performance of machine learning models of medical images.

Existing System

? The current paper presents a review of literature and a brief description of the existing researches made by the earlier researchers on data mining techniques with respect to breast cancer. ? Cancer is a harmful type of diseases that causes the cells in a part of body start to grow out of control. ? The cancer's cell development is unique in relation to normal cell development lastly the cancer cell turns into a tumor . ? Breast Cancer (BC) is the most widely recognized invasive cancer in females around the world. Breast Cancer is the second leading disease, next to lung cancer, which expands the death rate in ladies.

Disadvantages

The disadvantage of this approach is requirement of very big training and validation image datasets to build robust deep learning models, which are often unavailable in medical image fields. Another approach is use of radiomics concept and method, which generates a large initial feature pool followed by applying a feature selection method to select a small set of features. In this study, we investigate feasibility of applying a new approach based on random projection algorithm aiming to generate optimal feature vectors for training machine learning models implemented in the CAD schemes of mammograms to classify breast lesions. This approach creates orthogonal feature space that can avoid or minimize feature correlation

Proposed System

• This paper focuses on the learning perspectives of different systems, that provides not only an overview of the most common techniques encountered in disease diagnosis, but also manages to classify each paper in terms of solutions in predicting the disease. • The authors also classify each paper in terms of how diagnosis/prediction has been determined using different machine learning techniques. • This paper presents a survey about the research that had been carried out in predicting/diagnosing the breast cancer using different machine learning techniques.

Advantages

? we use a leave-one-case-out (LOCO) based cross-validation method to train SVM model and evaluate its performance. The feature dimensionality reduction method as discussed in the second step is also embedded in this LOCO iteration process to train the SVM. This can diminish the potential bias in the process of feature dimensionality reduction and machine learning model training as we demonstrated in our previous study ? From the confusion matrix, we compute classification accuracy, sensitivity, specificity, and odds ratio (OR) of each SVM model based on both lesion region and case. In the region-based performance evaluation, all lesion region are considered independent, while in the case-based performance evaluation, the average classification score of two matched lesion regions (if the lesions are detected and marked by radiologists in both CC and MLO view) is computed and used.

Download DOC Download PPT