Identification of Secondary Breast Cancer in Vital Organs through the Integration of Machine Learning and Microarraysopen access
- Authors
- Riaz, Faisal; Abid, Fazeel; Din, Ikram Ud; Kim, Byung-Seo; Almogren, Ahmad; Ul Durar, Shajara
- Issue Date
- 2-Jun-2022
- Publisher
- MDPI
- Keywords
- metastasis; microarray; gene expression omnibus; decision trees; random forest; K-nearest neighbours; support vector machine; K-means SMOTE
- Citation
- ELECTRONICS, v.11, no.12
- Journal Title
- ELECTRONICS
- Volume
- 11
- Number
- 12
- URI
- https://scholarworks.bwise.kr/hongik/handle/2020.sw.hongik/30085
- DOI
- 10.3390/electronics11121879
- ISSN
- 2079-9292
- Abstract
- Breast cancer includes genetic and environmental factors and is the most prevalent malignancy in women contributing to the pathogenesis and progression of cancer. Breast cancer prognosis metastasizes towards bones, the liver, brain, and lungs, and is the main cause of death in patients. Furthermore, the selection of features and classification is significant in microarray data analysis, which suffers from huge time consumption. To address these issues, this research uniquely integrates machine learning and microarrays to identify secondary breast cancer in vital organs. This work firstly imputes the missing values using K-nearest neighbors and improves the recursive feature elimination with cross-validation (RFECV) using the random forest method. Secondly, the class imbalance is handled by employing K-means synthetic object oversampling technique (SMOTE) to balance minority class and prevent noise. We successfully identified the 16 most essential Entrez gene ids responsible for predicting metastatic locations in the bones, brain, liver, and lungs. Extensive experiments are conducted on NCBI Gene Expression Omnibus GSE14020 and GSE54323 datasets. The proposed methods have handled class imbalance, prevented noise, and appropriately reduced time consumption. Reliable results were obtained on four classification models: decision tree; K-nearest neighbors; random forest; and support vector machine. Results are presented having considered confusion matrices, accuracy, ROC-AUC and PR-AUC, and F1-score.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - Graduate School > Software and Communications Engineering > 1. Journal Articles
![qrcode](https://api.qrserver.com/v1/create-qr-code/?size=55x55&data=https://scholarworks.bwise.kr/hongik/handle/2020.sw.hongik/30085)
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.