Windows 악성코드 패밀리 데이터셋 구축 및 분류 실험

김태영; 최두섭; 임을규

doi:10.3745/TKIPS.2025.14.9.651

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Windows 악성코드 패밀리 데이터셋 구축 및 분류 실험Windows Malware Family Dataset Construction and Classification

Other Titles: Windows Malware Family Dataset Construction and Classification

Authors: 김태영; 최두섭; 임을규

Issue Date: Sep-2025

Publisher: 한국정보처리학회

Keywords: Windows Malware; Family Classification; Dataset Construction; Static and Dynamic Analysis; Hybrid Approach; Windows 악성코드; 패밀리 분류; 데이터셋 구축; 정적 및 동적 분석; 하이브리드 접근법

Citation: 정보처리학회 논문지, v.14, no.9, pp 651 - 661

Pages: 11

Indexed: KCI

Journal Title: 정보처리학회 논문지

Volume: 14

Number: 9

Start Page: 651

End Page: 661

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209039

DOI: 10.3745/TKIPS.2025.14.9.651

ISSN: 3022-701
3022-7011

Abstract: 악성코드 패밀리의 분류는 위협 분석의 효율성을 높이고, 신속한 대응 전략 수립을 위해 반드시 해결해야 할 과제이다. 그러나 서로 다른 패밀리가유사한 행동을 보이는 경우가 많고, 변종 간 경계 또한 모호하여 정밀한 분류가 어렵다. 더욱이 대부분의 기존 연구는 과거 데이터셋에 의존하고있어, 최신 악성코드의 동향을 적시에 반영하지 못하는 한계가 있다. 본 연구에서는 이러한 문제를 극복하기 위해 레이블 교차검증을 통해 신뢰성을 확보한, 2024년 수집된 Windows 악성코드 3,357개 샘플로 구성된 최신 데이터셋을 직접 구축하였다. 이 데이터셋을 활용하여 정적⋅동적 특징을 결합한 하이브리드 특징을 랜덤포레스트 모델에 적용한 결과, 최대 92.14%의 분류 정확도를 달성하였다. 또한, 오분류된 사례에 대해 원인을분석한 결과, 일부1) 악성코드 패밀리 간에 동일한 API 호출 시퀀스가 공유되어 혼동이 발생하는 것으로 나타났으며, 악성코드가 조기에 실행을 종료해 충분한 동적 정보가 수집되지 못한 경우도 확인되었다. 이러한 분석을 바탕으로, 향후에는 더 정교한 행위 기반 특징 추출과 조기 종료 방지를 위한 동적 분석 환경의 개선이 필요함을 제안하였다. 본 연구는 향후 악성코드 탐지 시스템의 정확성과 신뢰성을 향상시키는 데 실질적인 기여를 할 수 있을 것으로 기대된다.
Malware family classification is a critical task for enhancing the efficiency of threat analysis and enabling rapid response strategies. However, accurate classification remains challenging due to behavioral similarities among different families and the ambiguous boundaries between variants. Moreover, most previous studies rely on outdated datasets, limiting their ability to reflect the latest trends in malware. To address these issues, this study constructs a new dataset of 3,357 Windows malware samples collected in 2024, with high label reliability ensured through cross-verification. Using this dataset, we applied a hybrid feature approach that combines static and dynamic features to a Random Forest model, achieving a maximum classification accuracy of 92.14%. An analysis of misclassified samples revealed that classification errors were often caused by shared API call sequences among certain malware families, leading to confusion, or by premature termination of malware execution, which hindered the collection of sufficient dynamic information. Based on these findings, we suggest the need for more sophisticated behavior-based feature extraction and improvements to the dynamic analysis environment to prevent early termination. This study is expected to make a practical contribution to enhancing the accuracy and reliability of future malware detection systems.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Im, Eul Gyu photo

Im, Eul Gyu: COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE