Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Extracting the Main Content of Web Pages Using the First Impression Areaopen access

Authors
Jung, GeunseongHan, SungjaeKim, HansungKim, kwangukCha, Jaehyuk
Issue Date
Dec-2022
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords
Boilerplate removal; main content extraction; web content extraction; web mining; web segmentation; block detection
Citation
IEEE Access, v.10, pp.129958 - 129969
Indexed
SCIE
SCOPUS
Journal Title
IEEE Access
Volume
10
Start Page
129958
End Page
129969
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/182218
DOI
10.1109/ACCESS.2022.3229080
ISSN
2169-3536
Abstract
Extracting the main content from a web page is essential in various applications such as web crawlers and browser reader modes. Existing extraction methods using text-based algorithms and features for English text can be ineffective for non-English web pages. This study proposes a main content extraction method that obtains visual and structural features from the rendered web page. Our method uses the first impression area (FIA), a part of a web page that users initially view. In this area, websites have applied many techniques that enable users to find the main content easily. Using the non-textual properties in the FIA, our method selects three points with high content area density and expands the area from each point until it meets several structural and visual-based conditions. We evaluated our method, browsers’ (Mozilla Firefox and Google Chrome) reader modes, and existing main content extraction methods on multilingual datasets using two measures: Longest Common Subsequences and matched text blocks. The results showed that our method performed better than other methods in both English (up to 46%, matched text blocks F0.5) and non-English (up to 42%, matched text blocks F0.5) web pages.
Files in This Item
Appears in
Collections
서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles
서울 사회과학대학 > 서울 사회학과 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Han sung photo

Kim, Han sung
COLLEGE OF SOCIAL SCIENCES (DEPARTMENT OF SOCIOLOGY)
Read more

Altmetrics

Total Views & Downloads

BROWSE