An adaptive crawler for locating hiddenwebentry points

Luciano Barbosa, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

Original languageEnglish (US)
Title of host publication16th International World Wide Web Conference, WWW2007
Pages441-450
Number of pages10
DOIs
StatePublished - 2007
Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
Duration: May 8 2007May 12 2007

Other

Other16th International World Wide Web Conference, WWW2007
CountryCanada
CityBanff, AB
Period5/8/075/12/07

Fingerprint

Websites
Experiments

Keywords

  • HiddenWeb
  • Learning classifiers
  • Online learning
  • Web crawling strategies

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hiddenwebentry points. In 16th International World Wide Web Conference, WWW2007 (pp. 441-450) https://doi.org/10.1145/1242572.1242632

An adaptive crawler for locating hiddenwebentry points. / Barbosa, Luciano; Freire, Juliana.

16th International World Wide Web Conference, WWW2007. 2007. p. 441-450.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barbosa, L & Freire, J 2007, An adaptive crawler for locating hiddenwebentry points. in 16th International World Wide Web Conference, WWW2007. pp. 441-450, 16th International World Wide Web Conference, WWW2007, Banff, AB, Canada, 5/8/07. https://doi.org/10.1145/1242572.1242632
Barbosa L, Freire J. An adaptive crawler for locating hiddenwebentry points. In 16th International World Wide Web Conference, WWW2007. 2007. p. 441-450 https://doi.org/10.1145/1242572.1242632
Barbosa, Luciano ; Freire, Juliana. / An adaptive crawler for locating hiddenwebentry points. 16th International World Wide Web Conference, WWW2007. 2007. pp. 441-450
@inproceedings{df4a1308c98c4eb482589772f6419934,
title = "An adaptive crawler for locating hiddenwebentry points",
abstract = "In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.",
keywords = "HiddenWeb, Learning classifiers, Online learning, Web crawling strategies",
author = "Luciano Barbosa and Juliana Freire",
year = "2007",
doi = "10.1145/1242572.1242632",
language = "English (US)",
isbn = "1595936548",
pages = "441--450",
booktitle = "16th International World Wide Web Conference, WWW2007",

}

TY - GEN

T1 - An adaptive crawler for locating hiddenwebentry points

AU - Barbosa, Luciano

AU - Freire, Juliana

PY - 2007

Y1 - 2007

N2 - In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

AB - In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

KW - HiddenWeb

KW - Learning classifiers

KW - Online learning

KW - Web crawling strategies

UR - http://www.scopus.com/inward/record.url?scp=35348920123&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35348920123&partnerID=8YFLogxK

U2 - 10.1145/1242572.1242632

DO - 10.1145/1242572.1242632

M3 - Conference contribution

AN - SCOPUS:35348920123

SN - 1595936548

SN - 9781595936547

SP - 441

EP - 450

BT - 16th International World Wide Web Conference, WWW2007

ER -