HTML 본문 추출을 위한 새로운 시각적 Feature

정근성; 차재혁

doi:10.9728/dcs.2023.24.4.691

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

HTML 본문 추출을 위한 새로운 시각적 Feature

Full metadata record

DC Field	Value	Language
dc.contributor.author	정근성	-
dc.contributor.author	차재혁	-
dc.date.accessioned	2023-06-01T07:09:32Z	-
dc.date.available	2023-06-01T07:09:32Z	-
dc.date.issued	2023-04	-
dc.identifier.issn	1598-2009	-
dc.identifier.issn	2287-738X	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/185904	-
dc.description.abstract	HTML 본문 추출이란 웹페이지의 본문 영역과 그 내용을 파악하는 기술이다. 기존 기술들이 본문 구별을 위해 사용하는 feature 는 주로 HTML 노드의 태그로 구성한 구조적 feature 이거나 노드가 포함하는 텍스트의 통계값으로 이루어진 텍스트 feature 이다. 그러나 이 feature 들은 웹페이지 템플릿의 유행, 언어, 지역 등에 의존적이다. 따라서 이 feature 들을 활용한 알고리즘이나 모델은 웹페이지의 언어나 환경으로 인한 성능 편차가 발생할 수 있다. 따라서 본 논문에서는 다국어 웹페이지에 대한 HTML 본문 추출 성능 저하를 최소화한 새로운 시각적 feature 들을 제안한다. 이 feature 들은 브라우저에 렌더링 된 HTML 노드의 결과의 속성에 기원하며, 언어나 지역의 영향이 상대적으로 적다. 본 논문에서는 Google TabNet 심층 신경망 아키텍처를 활용하여 기존의 구조적, 텍스트 feature 만을 학습한 신경망 모델 및 기존 feature 에 새롭게 제시한 시각적 feature 을 추가한 모델을 각각 학습하고 본문 추출 성능을 비교하여 본 논문에서 제시한 시각적 feature 의 성능 개선 효과를 입증하였다.	-
dc.description.abstract	Hypertext markup language (HTML) main content extraction is a technology that identifies the body and contents of an article from web pages. Traditional technologies use structural features, such as the tag structure of the HTML node and text features based on statistical properties. However, because these features depend on web development trends, language, and the region of the webpage, the performance of algorithms or models based on these features can vary. Therefore, in this study, we propose a novel visual feature to prevent the degradation of HTML body extraction performance on multilingual web pages. The feature is based on the results of HTML node attributes rendered in the browser; therefore, the influence of the language or region is relatively small. The Google TabNet deep neural network architecture was used to learn the neural network model based on only structural and text features, and subsequently another model with the newly introduced visual feature along with the structural and text features was trained. A comparison of the body extraction performance of the two models demonstrates the performance improvement provided by visual features in this study.	-
dc.format.extent	9	-
dc.language	한국어	-
dc.language.iso	KOR	-
dc.publisher	한국디지털콘텐츠학회	-
dc.title	HTML 본문 추출을 위한 새로운 시각적 Feature	-
dc.title.alternative	New Visual Features for HTML Main Content Extraction	-
dc.type	Article	-
dc.publisher.location	대한민국	-
dc.identifier.doi	10.9728/dcs.2023.24.4.691	-
dc.identifier.bibliographicCitation	디지털콘텐츠학회논문지, v.24, no.4, pp 691 - 699	-
dc.citation.title	디지털콘텐츠학회논문지	-
dc.citation.volume	24	-
dc.citation.number	4	-
dc.citation.startPage	691	-
dc.citation.endPage	699	-
dc.identifier.kciid	ART002952456	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	kci	-
dc.subject.keywordAuthor	Main content extraction	-
dc.subject.keywordAuthor	Webpage	-
dc.subject.keywordAuthor	Web content extraction	-
dc.subject.keywordAuthor	Deep neural net model	-
dc.subject.keywordAuthor	Google TabNet	-
dc.subject.keywordAuthor	주요 콘텐츠 추출	-
dc.subject.keywordAuthor	웹페이지	-
dc.subject.keywordAuthor	웹 콘텐츠 추출	-
dc.subject.keywordAuthor	신경망 모델	-
dc.subject.keywordAuthor	Google TabNet	-
dc.identifier.url	http://journal.dcs.or.kr/_common/do.php?a=full&b=12&bidx=3284&aidx=36513	-

Files in This Item

HTML 본문 추출을 위한 새로운 시각적 Feature.pdf 1.68 MB

Appears in Collections: 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Cha, Jae Hyuk photo

Cha, Jae Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE