Tractable near-optimal policies for crawling

Yossi Azar, Eric Horvitz, Eyal Lubetzky, Yuval Peres, Dafna Shahaf

Research output: Contribution to journalArticle

Abstract

The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(n log n) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.

Original languageEnglish (US)
Pages (from-to)8099-8103
Number of pages5
JournalProceedings of the National Academy of Sciences of the United States of America
Volume115
Issue number32
DOIs
StatePublished - Aug 7 2018

Fingerprint

Servers
Experiments

Keywords

  • Caching policies
  • Scheduling optimization
  • Web crawling

ASJC Scopus subject areas

  • General

Cite this

Tractable near-optimal policies for crawling. / Azar, Yossi; Horvitz, Eric; Lubetzky, Eyal; Peres, Yuval; Shahaf, Dafna.

In: Proceedings of the National Academy of Sciences of the United States of America, Vol. 115, No. 32, 07.08.2018, p. 8099-8103.

Research output: Contribution to journalArticle

Azar, Yossi ; Horvitz, Eric ; Lubetzky, Eyal ; Peres, Yuval ; Shahaf, Dafna. / Tractable near-optimal policies for crawling. In: Proceedings of the National Academy of Sciences of the United States of America. 2018 ; Vol. 115, No. 32. pp. 8099-8103.
@article{b2e65044c3cb434f940e16b700b4af98,
title = "Tractable near-optimal policies for crawling",
abstract = "The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(n log n) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99{\%} of the optimum.",
keywords = "Caching policies, Scheduling optimization, Web crawling",
author = "Yossi Azar and Eric Horvitz and Eyal Lubetzky and Yuval Peres and Dafna Shahaf",
year = "2018",
month = "8",
day = "7",
doi = "10.1073/pnas.1801519115",
language = "English (US)",
volume = "115",
pages = "8099--8103",
journal = "Proceedings of the National Academy of Sciences of the United States of America",
issn = "0027-8424",
number = "32",

}

TY - JOUR

T1 - Tractable near-optimal policies for crawling

AU - Azar, Yossi

AU - Horvitz, Eric

AU - Lubetzky, Eyal

AU - Peres, Yuval

AU - Shahaf, Dafna

PY - 2018/8/7

Y1 - 2018/8/7

N2 - The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(n log n) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.

AB - The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(n log n) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.

KW - Caching policies

KW - Scheduling optimization

KW - Web crawling

UR - http://www.scopus.com/inward/record.url?scp=85054929704&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054929704&partnerID=8YFLogxK

U2 - 10.1073/pnas.1801519115

DO - 10.1073/pnas.1801519115

M3 - Article

AN - SCOPUS:85054929704

VL - 115

SP - 8099

EP - 8103

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

SN - 0027-8424

IS - 32

ER -