Learning to discover domain-specific web content

Kien Pham, Aécio Santos, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.

Original languageEnglish (US)
Title of host publicationWSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining
PublisherAssociation for Computing Machinery, Inc
Pages432-440
Number of pages9
Volume2018-Febuary
ISBN (Electronic)9781450355810
DOIs
StatePublished - Feb 2 2018
Event11th ACM International Conference on Web Search and Data Mining, WSDM 2018 - Marina Del Rey, United States
Duration: Feb 5 2018Feb 9 2018

Other

Other11th ACM International Conference on Web Search and Data Mining, WSDM 2018
CountryUnited States
CityMarina Del Rey
Period2/5/182/9/18

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Computer Networks and Communications
  • Information Systems

Cite this

Pham, K., Santos, A., & Freire, J. (2018). Learning to discover domain-specific web content. In WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining (Vol. 2018-Febuary, pp. 432-440). Association for Computing Machinery, Inc. https://doi.org/10.1145/3159652.3159724

Learning to discover domain-specific web content. / Pham, Kien; Santos, Aécio; Freire, Juliana.

WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining. Vol. 2018-Febuary Association for Computing Machinery, Inc, 2018. p. 432-440.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pham, K, Santos, A & Freire, J 2018, Learning to discover domain-specific web content. in WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining. vol. 2018-Febuary, Association for Computing Machinery, Inc, pp. 432-440, 11th ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, United States, 2/5/18. https://doi.org/10.1145/3159652.3159724
Pham K, Santos A, Freire J. Learning to discover domain-specific web content. In WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining. Vol. 2018-Febuary. Association for Computing Machinery, Inc. 2018. p. 432-440 https://doi.org/10.1145/3159652.3159724
Pham, Kien ; Santos, Aécio ; Freire, Juliana. / Learning to discover domain-specific web content. WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining. Vol. 2018-Febuary Association for Computing Machinery, Inc, 2018. pp. 432-440
@inproceedings{2cb2a0fc098a4baaa8feeffa097f2479,
title = "Learning to discover domain-specific web content",
abstract = "The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150{\%} higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80{\%} of new relevant content within less than 4 hours of publication.",
author = "Kien Pham and A{\'e}cio Santos and Juliana Freire",
year = "2018",
month = "2",
day = "2",
doi = "10.1145/3159652.3159724",
language = "English (US)",
volume = "2018-Febuary",
pages = "432--440",
booktitle = "WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Learning to discover domain-specific web content

AU - Pham, Kien

AU - Santos, Aécio

AU - Freire, Juliana

PY - 2018/2/2

Y1 - 2018/2/2

N2 - The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.

AB - The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.

UR - http://www.scopus.com/inward/record.url?scp=85046901390&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046901390&partnerID=8YFLogxK

U2 - 10.1145/3159652.3159724

DO - 10.1145/3159652.3159724

M3 - Conference contribution

VL - 2018-Febuary

SP - 432

EP - 440

BT - WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining

PB - Association for Computing Machinery, Inc

ER -