Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints

Authors
Bae, Su-YongLee, JongaJeong, JaeseongLim, ChangwonChoi, Jinhee
Issue Date
Nov-2021
Publisher
Elsevier B.V.
Keywords
Class imbalance; Data balancing; Genotoxicity; Machine learning; Toxicity prediction
Citation
Computational Toxicology, v.20
Journal Title
Computational Toxicology
Volume
20
URI
https://scholarworks.bwise.kr/cau/handle/2019.sw.cau/48904
DOI
10.1016/j.comtox.2021.100178
ISSN
2468-1113
Abstract
Machine learning and deep learning approaches have been increasingly used in the field of toxicology through prediction models developed using various toxicity data. However, toxicity data are often class-imbalanced, which hinders the development of machine learning models with good performance. Therefore, in this study, we identified effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints. Data-balancing methods, such as random undersampling (RUS), sample weight (SW), synthetic minority oversampling technique (SMOTE), and random oversampling (ROS) were applied to the datasets. Model performance was evaluated using the F1 score on five machine learning algorithms: gradient boosting tree (GBT), random forest (RF), support vector machine (SVM), multi-layer perceptron (MLP) network, and k-nearest neighbors (kNN) in combination with five molecular fingerprints (Morgan, MACCS, RDKit, Pattern, and Layered). The performance was evaluated for each combination of molecular fingerprints, machine learning algorithms, and data-balancing methods. The MACCS-GBT-SMOTE combination model achieved the best F1 score, followed by RDKit-GBT-SW. Thus, this study demonstrated that data balancing conducted using oversampling methods improved the performance of models. The systematic approach used in this study can also be applied to other toxicity datasets, which may facilitate the development of an improved classification model for toxicity screening. © 2021
Files in This Item
There are no files associated with this item.
Appears in
Collections
Graduate School > ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Lim, Chang Won photo

Lim, Chang Won
대학원 (통계데이터사이언스학과)
Read more

Altmetrics

Total Views & Downloads

BROWSE