Finding seeds to bootstrap focused crawlers

Karane Vieira, Luciano Barbosa, Altigran Soares da Silva, Juliana Freire, Edleno Moura

Research output: Contribution to journalArticle

Abstract

Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

Original languageEnglish (US)
Pages (from-to)449-474
Number of pages26
JournalWorld Wide Web
Volume19
Issue number3
DOIs
StatePublished - May 1 2016

Keywords

  • Focused crawling
  • Relevance feedback
  • Web crawling

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Finding seeds to bootstrap focused crawlers'. Together they form a unique fingerprint.

  • Cite this

    Vieira, K., Barbosa, L., da Silva, A. S., Freire, J., & Moura, E. (2016). Finding seeds to bootstrap focused crawlers. World Wide Web, 19(3), 449-474. https://doi.org/10.1007/s11280-015-0331-7