Main content extraction from web documents using text block context
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Kim, M. | - |
dc.contributor.author | Kim, Y. | - |
dc.contributor.author | Song, W. | - |
dc.contributor.author | Khil, A. | - |
dc.date.available | 2019-04-10T10:23:32Z | - |
dc.date.created | 2018-04-17 | - |
dc.date.issued | 2013 | - |
dc.identifier.isbn | 9783642401725 | - |
dc.identifier.issn | 0302-9743 | - |
dc.identifier.uri | http://scholarworks.bwise.kr/ssu/handle/2018.sw.ssu/32881 | - |
dc.description.abstract | Due to various Web authoring tools, the new web standards, and improved web accessibility, a wide variety of Web contents are being produced very quickly. In such an environment, in order to provide appropriate Web services to users' needs it is important to quickly and accurately extract relevant information from Web documents and remove irrelevant contents such as advertisements. In this paper, we propose a method that extracts main content accurately from HTML Web documents. In the method, a decision tree is built and used to classify each block of text whether it is a part of the main content. For classification we use contextual features around text blocks including word density, link density, HTML tag distribution, and distances between text blocks. We experimented with our method using a published data set and a data set that we collected. The experiment results show that our method performs 19% better in F-measure compared to the existing best performing method. © 2013 Springer-Verlag. | - |
dc.relation.isPartOf | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | - |
dc.title | Main content extraction from web documents using text block context | - |
dc.type | Conference | - |
dc.identifier.doi | 10.1007/978-3-642-40173-2_10 | - |
dc.type.rims | CONF | - |
dc.identifier.bibliographicCitation | 24th International Conference on Database and Expert Systems Applications, DEXA 2013, v.8056 LNCS, no.PART 2, pp.81 - 93 | - |
dc.description.journalClass | 2 | - |
dc.identifier.scopusid | 2-s2.0-84884405875 | - |
dc.citation.conferenceDate | 2013-08-26 | - |
dc.citation.conferencePlace | Prague | - |
dc.citation.endPage | 93 | - |
dc.citation.number | PART 2 | - |
dc.citation.startPage | 81 | - |
dc.citation.title | 24th International Conference on Database and Expert Systems Applications, DEXA 2013 | - |
dc.citation.volume | 8056 LNCS | - |
dc.contributor.affiliatedAuthor | Kim, M. | - |
dc.contributor.affiliatedAuthor | Khil, A. | - |
dc.type.docType | Conference Paper | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
Soongsil University Library 369 Sangdo-Ro, Dongjak-Gu, Seoul, Korea (06978)02-820-0733
COPYRIGHT ⓒ SOONGSIL UNIVERSITY, ALL RIGHTS RESERVED.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.