Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints
- Authors
- Bae, Su-Yong; Lee, Jonga; Jeong, Jaeseong; Lim, Changwon; Choi, Jinhee
- Issue Date
- Nov-2021
- Publisher
- Elsevier B.V.
- Keywords
- Class imbalance; Data balancing; Genotoxicity; Machine learning; Toxicity prediction
- Citation
- Computational Toxicology, v.20
- Journal Title
- Computational Toxicology
- Volume
- 20
- URI
- https://scholarworks.bwise.kr/cau/handle/2019.sw.cau/48904
- DOI
- 10.1016/j.comtox.2021.100178
- ISSN
- 2468-1113
- Abstract
- Machine learning and deep learning approaches have been increasingly used in the field of toxicology through prediction models developed using various toxicity data. However, toxicity data are often class-imbalanced, which hinders the development of machine learning models with good performance. Therefore, in this study, we identified effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints. Data-balancing methods, such as random undersampling (RUS), sample weight (SW), synthetic minority oversampling technique (SMOTE), and random oversampling (ROS) were applied to the datasets. Model performance was evaluated using the F1 score on five machine learning algorithms: gradient boosting tree (GBT), random forest (RF), support vector machine (SVM), multi-layer perceptron (MLP) network, and k-nearest neighbors (kNN) in combination with five molecular fingerprints (Morgan, MACCS, RDKit, Pattern, and Layered). The performance was evaluated for each combination of molecular fingerprints, machine learning algorithms, and data-balancing methods. The MACCS-GBT-SMOTE combination model achieved the best F1 score, followed by RDKit-GBT-SW. Thus, this study demonstrated that data balancing conducted using oversampling methods improved the performance of models. The systematic approach used in this study can also be applied to other toxicity datasets, which may facilitate the development of an improved classification model for toxicity screening. © 2021
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - Graduate School > ETC > 1. Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.