Detailed Information

Cited 1 time in webofscience Cited 1 time in scopus
Metadata Downloads

Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models

Authors
Sadad, TariqAurangzeb, Raja AtifSafran, MejdlImran,Alfarhood, SultanKim, Jungsuk
Issue Date
Apr-2023
Publisher
MDPI
Keywords
BERT; deep learning; DNA/RNA sequence; K-MERS
Citation
Biomedicines, v.11, no.5
Journal Title
Biomedicines
Volume
11
Number
5
URI
https://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/88356
DOI
10.3390/biomedicines11051323
ISSN
2227-9059
Abstract
Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences.
Files in This Item
There are no files associated with this item.
Appears in
Collections
ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Imran,  photo

Imran,
College of IT Convergence (의공학과)
Read more

Altmetrics

Total Views & Downloads

BROWSE