Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Extracting the Main Content of Web Pages Using the First Impression Area

Full metadata record
DC Field Value Language
dc.contributor.authorJung, Geunseong-
dc.contributor.authorHan, Sungjae-
dc.contributor.authorKim, Hansung-
dc.contributor.authorKim, kwanguk-
dc.contributor.authorCha, Jaehyuk-
dc.date.accessioned2023-01-25T10:07:10Z-
dc.date.available2023-01-25T10:07:10Z-
dc.date.created2023-01-05-
dc.date.issued2022-12-
dc.identifier.issn2169-3536-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/182218-
dc.description.abstractExtracting the main content from a web page is essential in various applications such as web crawlers and browser reader modes. Existing extraction methods using text-based algorithms and features for English text can be ineffective for non-English web pages. This study proposes a main content extraction method that obtains visual and structural features from the rendered web page. Our method uses the first impression area (FIA), a part of a web page that users initially view. In this area, websites have applied many techniques that enable users to find the main content easily. Using the non-textual properties in the FIA, our method selects three points with high content area density and expands the area from each point until it meets several structural and visual-based conditions. We evaluated our method, browsers’ (Mozilla Firefox and Google Chrome) reader modes, and existing main content extraction methods on multilingual datasets using two measures: Longest Common Subsequences and matched text blocks. The results showed that our method performed better than other methods in both English (up to 46%, matched text blocks F0.5) and non-English (up to 42%, matched text blocks F0.5) web pages.-
dc.language영어-
dc.language.isoen-
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC-
dc.titleExtracting the Main Content of Web Pages Using the First Impression Area-
dc.typeArticle-
dc.contributor.affiliatedAuthorKim, Hansung-
dc.contributor.affiliatedAuthorKim, kwanguk-
dc.contributor.affiliatedAuthorCha, Jaehyuk-
dc.identifier.doi10.1109/ACCESS.2022.3229080-
dc.identifier.scopusid2-s2.0-85144797702-
dc.identifier.wosid000902045500001-
dc.identifier.bibliographicCitationIEEE Access, v.10, pp.129958 - 129969-
dc.relation.isPartOfIEEE Access-
dc.citation.titleIEEE Access-
dc.citation.volume10-
dc.citation.startPage129958-
dc.citation.endPage129969-
dc.type.rimsART-
dc.type.docTypeArticle in Press-
dc.description.journalClass1-
dc.description.isOpenAccessY-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalResearchAreaEngineering-
dc.relation.journalResearchAreaTelecommunications-
dc.relation.journalWebOfScienceCategoryComputer Science, Information Systems-
dc.relation.journalWebOfScienceCategoryEngineering, Electrical & Electronic-
dc.relation.journalWebOfScienceCategoryTelecommunications-
dc.subject.keywordPlusEYE-MOVEMENT-
dc.subject.keywordPlusSEGMENTATION-
dc.subject.keywordPlusPERCEPTIONS-
dc.subject.keywordPlusATTENTION-
dc.subject.keywordAuthorBoilerplate removal-
dc.subject.keywordAuthormain content extraction-
dc.subject.keywordAuthorweb content extraction-
dc.subject.keywordAuthorweb mining-
dc.subject.keywordAuthorweb segmentation-
dc.subject.keywordAuthorblock detection-
dc.identifier.urlhttps://ieeexplore.ieee.org/document/9984637-
Files in This Item
Appears in
Collections
서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles
서울 사회과학대학 > 서울 사회학과 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Han sung photo

Kim, Han sung
COLLEGE OF SOCIAL SCIENCES (DEPARTMENT OF SOCIOLOGY)
Read more

Altmetrics

Total Views & Downloads

BROWSE