Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Implementation of Efficient Distributed Crawler through Stepwise Crawling Node AllocationImplementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation

Other Titles
Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation
Authors
김현태변준형나요셉정유철
Issue Date
2020
Publisher
한국정보기술학회
Keywords
Web crawling; docker wwarm; virtual nodes; documents; scrapy; efficiency
Citation
한국정보기술학회 영문논문지, v.10, no.2, pp.15 - 31
Journal Title
한국정보기술학회 영문논문지
Volume
10
Number
2
Start Page
15
End Page
31
URI
https://scholarworks.bwise.kr/kumoh/handle/2020.sw.kumoh/18544
ISSN
2234-1072
Abstract
Various websites have been created due to the increased use of the Internet, and the number of documents distributed through these websites has increased proportionally. However, it is not easy to collect newly updated documents rapidly. Web crawling methods have been used to continuously collect and manage new documents, whereas existing crawling systems applying a single node demonstrate limited performances. Furthermore, crawlers applying distribution methods exhibit a problem related to effective node management for crawling. This study proposes an efficient distributed crawler through stepwise crawling node allocation, which identifies websites' properties and establishes crawling policies based on the properties identified to collect a large number of documents from multiple websites. The proposed crawler can calculate the number of documents included in a website, compare data collection time and the amount of data collected based on the number of nodes allocated to a specific website by repeatedly visiting the website, and automatically allocate the optimal number of nodes to each website for crawling. An experiment is conducted where the proposed and single-node methods are applied to 12 different websites; the experimental result indicates that the proposed crawler's data collection time decreased significantly compared with that of a single node crawler. This result is obtained because the proposed crawler applied data collection policies according to websites. Besides, it is confirmed that the work rate of the proposed model increased.
Files in This Item
There are no files associated with this item.
Appears in
Collections
Department of Computer Engineering > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher JUNG, YU CHUL photo

JUNG, YU CHUL
College of Engineering (Department of Computer Engineering)
Read more

Altmetrics

Total Views & Downloads

BROWSE