Developing an automated framework for eco-label information categorization using web crawling and Natural Language Processing techniques
- Authors
- Nguyen, Ho Anh Thu; Pham, Duy Hoang; Kim, Byeol; Ahn, Yonghan; Kwon, Nahyun
- Issue Date
- Jul-2025
- Publisher
- PERGAMON-ELSEVIER SCIENCE LTD
- Keywords
- Green building material; Eco-label; Information management; Machine learning; Natural language processing
- Citation
- EXPERT SYSTEMS WITH APPLICATIONS, v.282, pp 1 - 24
- Pages
- 24
- Indexed
- SCIE
SCOPUS
- Journal Title
- EXPERT SYSTEMS WITH APPLICATIONS
- Volume
- 282
- Start Page
- 1
- End Page
- 24
- URI
- https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/125248
- DOI
- 10.1016/j.eswa.2025.127688
- ISSN
- 0957-4174
1873-6793
- Abstract
- Eco-labels are extensively employed to assess the environmental performance of building materials. However, their management is often fragmented across disparate online databases with inconsistent data structures, presenting significant challenges for efficient information acquisition and management. This study explores the application of web crawling techniques, Natural Language Processing (NLP), and machine learning (ML) models to collect and categorize eco-label information, with the objective of advancing the automation of information management processes. The results demonstrate that the categorization models exhibit high performance, achieving F1-scores exceeding 0.95 on the test set and at least 0.76 when validating datasets incorporating temporally updated information. However, the limited availability of data for certain eco-labels, such as Forest Stewardship Council certification and Green Screen, substantially degrades model performance with updated data. Notably, traditional ML models leveraging manual feature engineering outperform deep learning models with automatic feature extraction when applied to web-crawled data. Furthermore, the TF-IDF feature extraction technique surpasses other n-gram-based approaches, with model performance declining as n-gram length increases. This study establishes a systematic framework that informs the selection of reliable data sources, feature engineering strategies, and ML algorithms for integrating web crawling, thereby enhancing the automation of eco-label information management.
- Files in This Item
-
Go to Link
- Appears in
Collections - COLLEGE OF ENGINEERING SCIENCES > MAJOR IN ARCHITECTURAL ENGINEERING > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.