Efficient Phishing Website Detection via HTML Tag Sequence Analysis Using Encoder Models
- Authors
- Ahn, Jemin; Xiong, Zuobin; Cho, Homook; Kang, Kyungtae; Son, Junggab
- Issue Date
- Aug-2025
- Publisher
- Institute of Electrical and Electronics Engineers Inc.
- Keywords
- Classification (of Information); Computer Crime; Html; Internet Of Things; Learning Algorithms; Learning Systems; Machine Learning; Network Security; Phishing; Websites; Defence Systems; Detection Methods; Html Tags; Machine-learning; Network Users; Phishing Websites; Security Measure; Security Mechanism; Sequence Analysis; Signal Encoding
- Citation
- Proceedings - International Conference on Computer Communications and Networks, ICCCN
- Indexed
- SCOPUS
- Journal Title
- Proceedings - International Conference on Computer Communications and Networks, ICCCN
- URI
- https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/126570
- DOI
- 10.1109/ICCCN65249.2025.11133972
- ISSN
- 1095-2055
- Abstract
- The rapid proliferation of Internet of Things (IoT) devices has led to a significant increase in the number of network users, prompting advancements in security mechanisms. Consequently, traditional attacks targeting specific vulnerabilities have become less effective due to these enhanced defense systems, leading attackers to increasingly adopt phishing strategies as a primary means of bypassing security measures. Among these, phishing websites have been increasing rapidly, exploiting the carelessness of countless users. In response, numerous phishing website detection methods have been investigated, with machine learning-based approaches emerging as a leading strategy. However, these machine learning-based classification methods require substantial computational resources, posing challenges for their direct application in the already widespread IoT environment. To address these challenges, we propose an efficient phishing website detection method based on HTML tag sequences, the core structural elements of websites, by leveraging encoder models known for their effectiveness in classifying sequential data. Our approach also incorporates a customized tokenizer and dictionary specifically tailored for HTML tags. Experiments conducted on publicly available datasets demonstrate that the proposed method achieves over 95% accuracy across key performance metrics. Furthermore, comparative analyses highlight several advantages of our method, including reduced model size and faster detection times compared to existing approaches.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - COLLEGE OF COMPUTING > DEPARTMENT OF ARTIFICIAL INTELLIGENCE > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.