Extracting the Main Content of Web Pages Using the First Impression Areaopen access
- Authors
- Jung, Geunseong; Han, Sungjae; Kim, Hansung; Kim, kwanguk; Cha, Jaehyuk
- Issue Date
- Dec-2022
- Publisher
- IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- Keywords
- Boilerplate removal; main content extraction; web content extraction; web mining; web segmentation; block detection
- Citation
- IEEE Access, v.10, pp.129958 - 129969
- Indexed
- SCIE
SCOPUS
- Journal Title
- IEEE Access
- Volume
- 10
- Start Page
- 129958
- End Page
- 129969
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/182218
- DOI
- 10.1109/ACCESS.2022.3229080
- ISSN
- 2169-3536
- Abstract
- Extracting the main content from a web page is essential in various applications such as web crawlers and browser reader modes. Existing extraction methods using text-based algorithms and features for English text can be ineffective for non-English web pages. This study proposes a main content extraction method that obtains visual and structural features from the rendered web page. Our method uses the first impression area (FIA), a part of a web page that users initially view. In this area, websites have applied many techniques that enable users to find the main content easily. Using the non-textual properties in the FIA, our method selects three points with high content area density and expands the area from each point until it meets several structural and visual-based conditions. We evaluated our method, browsers’ (Mozilla Firefox and Google Chrome) reader modes, and existing main content extraction methods on multilingual datasets using two measures: Longest Common Subsequences and matched text blocks. The results showed that our method performed better than other methods in both English (up to 46%, matched text blocks F0.5) and non-English (up to 42%, matched text blocks F0.5) web pages.
- Files in This Item
-
- Appears in
Collections - 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles
- 서울 사회과학대학 > 서울 사회학과 > 1. Journal Articles
![qrcode](https://api.qrserver.com/v1/create-qr-code/?size=55x55&data=https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/182218)
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.