Finding seeds to bootstrap focused crawlers

Karane Vieira, Luciano Barbosa, Altigran Soares da Silva, Juliana Freire, Edleno Moura

Research output: Contribution to journalArticle

Abstract

Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

Original languageEnglish (US)
Pages (from-to)449-474
Number of pages26
JournalWorld Wide Web
Volume19
Issue number3
DOIs
StatePublished - May 1 2016

Fingerprint

Seed

Keywords

  • Focused crawling
  • Relevance feedback
  • Web crawling

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Cite this

Vieira, K., Barbosa, L., da Silva, A. S., Freire, J., & Moura, E. (2016). Finding seeds to bootstrap focused crawlers. World Wide Web, 19(3), 449-474. https://doi.org/10.1007/s11280-015-0331-7

Finding seeds to bootstrap focused crawlers. / Vieira, Karane; Barbosa, Luciano; da Silva, Altigran Soares; Freire, Juliana; Moura, Edleno.

In: World Wide Web, Vol. 19, No. 3, 01.05.2016, p. 449-474.

Research output: Contribution to journalArticle

Vieira, K, Barbosa, L, da Silva, AS, Freire, J & Moura, E 2016, 'Finding seeds to bootstrap focused crawlers', World Wide Web, vol. 19, no. 3, pp. 449-474. https://doi.org/10.1007/s11280-015-0331-7
Vieira K, Barbosa L, da Silva AS, Freire J, Moura E. Finding seeds to bootstrap focused crawlers. World Wide Web. 2016 May 1;19(3):449-474. https://doi.org/10.1007/s11280-015-0331-7
Vieira, Karane ; Barbosa, Luciano ; da Silva, Altigran Soares ; Freire, Juliana ; Moura, Edleno. / Finding seeds to bootstrap focused crawlers. In: World Wide Web. 2016 ; Vol. 19, No. 3. pp. 449-474.
@article{8cad5d4bdc4346909ef36cd5378686d7,
title = "Finding seeds to bootstrap focused crawlers",
abstract = "Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.",
keywords = "Focused crawling, Relevance feedback, Web crawling",
author = "Karane Vieira and Luciano Barbosa and {da Silva}, {Altigran Soares} and Juliana Freire and Edleno Moura",
year = "2016",
month = "5",
day = "1",
doi = "10.1007/s11280-015-0331-7",
language = "English (US)",
volume = "19",
pages = "449--474",
journal = "World Wide Web",
issn = "1386-145X",
publisher = "Springer New York",
number = "3",

}

TY - JOUR

T1 - Finding seeds to bootstrap focused crawlers

AU - Vieira, Karane

AU - Barbosa, Luciano

AU - da Silva, Altigran Soares

AU - Freire, Juliana

AU - Moura, Edleno

PY - 2016/5/1

Y1 - 2016/5/1

N2 - Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

AB - Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

KW - Focused crawling

KW - Relevance feedback

KW - Web crawling

UR - http://www.scopus.com/inward/record.url?scp=84961208820&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961208820&partnerID=8YFLogxK

U2 - 10.1007/s11280-015-0331-7

DO - 10.1007/s11280-015-0331-7

M3 - Article

AN - SCOPUS:84961208820

VL - 19

SP - 449

EP - 474

JO - World Wide Web

JF - World Wide Web

SN - 1386-145X

IS - 3

ER -