Extracting the Main Content of Web Pages Using the First Impression Area
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Jung, Geunseong | - |
dc.contributor.author | Han, Sungjae | - |
dc.contributor.author | Kim, Hansung | - |
dc.contributor.author | Kim, kwanguk | - |
dc.contributor.author | Cha, Jaehyuk | - |
dc.date.accessioned | 2023-01-25T10:07:10Z | - |
dc.date.available | 2023-01-25T10:07:10Z | - |
dc.date.created | 2023-01-05 | - |
dc.date.issued | 2022-12 | - |
dc.identifier.issn | 2169-3536 | - |
dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/182218 | - |
dc.description.abstract | Extracting the main content from a web page is essential in various applications such as web crawlers and browser reader modes. Existing extraction methods using text-based algorithms and features for English text can be ineffective for non-English web pages. This study proposes a main content extraction method that obtains visual and structural features from the rendered web page. Our method uses the first impression area (FIA), a part of a web page that users initially view. In this area, websites have applied many techniques that enable users to find the main content easily. Using the non-textual properties in the FIA, our method selects three points with high content area density and expands the area from each point until it meets several structural and visual-based conditions. We evaluated our method, browsers’ (Mozilla Firefox and Google Chrome) reader modes, and existing main content extraction methods on multilingual datasets using two measures: Longest Common Subsequences and matched text blocks. The results showed that our method performed better than other methods in both English (up to 46%, matched text blocks F0.5) and non-English (up to 42%, matched text blocks F0.5) web pages. | - |
dc.language | 영어 | - |
dc.language.iso | en | - |
dc.publisher | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC | - |
dc.title | Extracting the Main Content of Web Pages Using the First Impression Area | - |
dc.type | Article | - |
dc.contributor.affiliatedAuthor | Kim, Hansung | - |
dc.contributor.affiliatedAuthor | Kim, kwanguk | - |
dc.contributor.affiliatedAuthor | Cha, Jaehyuk | - |
dc.identifier.doi | 10.1109/ACCESS.2022.3229080 | - |
dc.identifier.scopusid | 2-s2.0-85144797702 | - |
dc.identifier.wosid | 000902045500001 | - |
dc.identifier.bibliographicCitation | IEEE Access, v.10, pp.129958 - 129969 | - |
dc.relation.isPartOf | IEEE Access | - |
dc.citation.title | IEEE Access | - |
dc.citation.volume | 10 | - |
dc.citation.startPage | 129958 | - |
dc.citation.endPage | 129969 | - |
dc.type.rims | ART | - |
dc.type.docType | Article in Press | - |
dc.description.journalClass | 1 | - |
dc.description.isOpenAccess | Y | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
dc.relation.journalResearchArea | Computer Science | - |
dc.relation.journalResearchArea | Engineering | - |
dc.relation.journalResearchArea | Telecommunications | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Information Systems | - |
dc.relation.journalWebOfScienceCategory | Engineering, Electrical & Electronic | - |
dc.relation.journalWebOfScienceCategory | Telecommunications | - |
dc.subject.keywordPlus | EYE-MOVEMENT | - |
dc.subject.keywordPlus | SEGMENTATION | - |
dc.subject.keywordPlus | PERCEPTIONS | - |
dc.subject.keywordPlus | ATTENTION | - |
dc.subject.keywordAuthor | Boilerplate removal | - |
dc.subject.keywordAuthor | main content extraction | - |
dc.subject.keywordAuthor | web content extraction | - |
dc.subject.keywordAuthor | web mining | - |
dc.subject.keywordAuthor | web segmentation | - |
dc.subject.keywordAuthor | block detection | - |
dc.identifier.url | https://ieeexplore.ieee.org/document/9984637 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1365
COPYRIGHT © 2021 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.