Machine learning-based detection method for malicious PDF files: A temporal classification approach

Choi, Doo-Seop; Kim, Taeguen; Kang, Boojoong; Im, Eul Gyu

doi:10.1016/j.asoc.2025.114461

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Machine learning-based detection method for malicious PDF files: A temporal classification approach

Full metadata record

DC Field	Value	Language
dc.contributor.author	Choi, Doo-Seop	-
dc.contributor.author	Kim, Taeguen	-
dc.contributor.author	Kang, Boojoong	-
dc.contributor.author	Im, Eul Gyu	-
dc.date.accessioned	2026-02-10T06:01:40Z	-
dc.date.available	2026-02-10T06:01:40Z	-
dc.date.issued	2026-03	-
dc.identifier.issn	1568-4946	-
dc.identifier.issn	1872-9681	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/210734	-
dc.description.abstract	Cybercriminals increasingly exploit non-executable files that can bypass antivirus software detection and are often opened by users without suspicion. In particular, PDF files have become a primary attack vector for adversaries due to their platform-independent nature and ability to preserve document components across different systems. Malicious PDF files continuously evolve to avoid detection, and traditional detection methods, which rely primarily on static features from older PDF datasets, show limitations in identifying evolving malicious PDF files. This paper identifies temporal evolution in feature distributions and proposes a novel framework to detect malicious PDF files by introducing temporal classification and addressing the evolved characteristics of recent threats. Through in-depth statistical analysis, we revealed that recent malicious PDF files closely mimic the structural characteristics of legitimate files, exhibiting an 11-fold increase in graphic components and a 21-fold increase in hyperlinks compared to older samples. This finding indicates a significant shift in attack methodologies from traditional script injection to social engineering techniques. To address this challenge, we enhanced the basic feature set, comprising 31 structural and metadata-based features initially defined in the CIC-Evasive-PDFMal2022 dataset, by integrating 12 newly identified features, resulting in an enhanced set of 43 features. Experimental results demonstrate that our framework with the enhanced feature set achieves 97.80 % detection accuracy using the random forest algorithm, representing a 4.12 % improvement over the basic feature set. The framework maintains balanced performance across all metrics with a recall of 0.96, a precision of 0.98, an F1-score of 0.97, and an AUC of 0.99. Additionally, the framework reduced the false positive rate (FPR) from 2.84 % to 1.12 %, a 1.72 percentage points reduction, which is critical for practical deployment in real-world security environments. The proposed enhanced feature set provides an effective approach for strengthening real-world detection systems, including email attachment scanners and antivirus engines, against evolving PDF-based attacks	-
dc.format.extent	22	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	ELSEVIER	-
dc.title	Machine learning-based detection method for malicious PDF files: A temporal classification approach	-
dc.type	Article	-
dc.publisher.location	네델란드	-
dc.identifier.doi	10.1016/j.asoc.2025.114461	-
dc.identifier.scopusid	2-s2.0-105027474566	-
dc.identifier.wosid	001658578000003	-
dc.identifier.bibliographicCitation	Applied Soft Computing, v.189, pp 1 - 22	-
dc.citation.title	Applied Soft Computing	-
dc.citation.volume	189	-
dc.citation.startPage	1	-
dc.citation.endPage	22	-
dc.type.docType	Article	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.relation.journalWebOfScienceCategory	Computer Science, Interdisciplinary Applications	-
dc.subject.keywordPlus	Classification (of information)	-
dc.subject.keywordPlus	Computer viruses	-
dc.subject.keywordPlus	Feature extraction	-
dc.subject.keywordPlus	Learning algorithms	-
dc.subject.keywordPlus	Network security	-
dc.subject.keywordAuthor	Analysis of temporal feature evolution	-
dc.subject.keywordAuthor	Machine learning	-
dc.subject.keywordAuthor	Malware detection	-
dc.subject.keywordAuthor	Non-executable malware	-
dc.subject.keywordAuthor	PDF malware	-
dc.identifier.url	https://www.sciencedirect.com/science/article/pii/S1568494625017740?via%3Dihub	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 미래자동차공학과 > 1. Journal Articles; 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Im, Eul Gyu photo

Im, Eul Gyu: COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE