Scheduling Web Crawl for Better Performance and Quality

Report ID:
September 2003
Web crawler is an essential component of search engines, data
mining and other Internet applications. Scheduling Web pages to be
downloaded is an important aspect of crawling. Previous research on Web
crawl focused on optimizing either crawl speed or quality of the Web pages
downloaded. While both metrics are important, scheduling using one of them
alone is insufficient and can bias or hurt overall crawl process. This
paper explores the design space of crawl scheduling to balance performance
and quality factors and optimize the global crawl efficiency. We design a
network-efficient scheduling framework and use it to evaluate various
scheduling strategies. We also define a new scheduling algorithm that
factor both network performance and Web page quality into scheduling
decision-making. Real world experiments clearly demonstrate the
effectiveness of the two-level scheduling scheme and the new algorithm in
improving overall crawl efficiency. Experiments also show that
crawl-scheduling design can always be optimized based on full
understanding of application properties.

