Understanding website behavior based on user agent

Kien Pham, Aécio Santos, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them. In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.

Original languageEnglish (US)
Title of host publicationSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages1053-1056
Number of pages4
ISBN (Electronic)9781450342902
DOIs
StatePublished - Jul 7 2016
Event39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy
Duration: Jul 17 2016Jul 21 2016

Other

Other39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
CountryItaly
CityPisa
Period7/17/167/21/16

Fingerprint

Websites
HTTP
Costs

Keywords

  • Stealth crawling
  • User-agent string
  • Web cloaking
  • Web crawler detection

ASJC Scopus subject areas

  • Information Systems
  • Software

Cite this

Pham, K., Santos, A., & Freire, J. (2016). Understanding website behavior based on user agent. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1053-1056). Association for Computing Machinery, Inc. https://doi.org/10.1145/2911451.2914757

Understanding website behavior based on user agent. / Pham, Kien; Santos, Aécio; Freire, Juliana.

SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. p. 1053-1056.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pham, K, Santos, A & Freire, J 2016, Understanding website behavior based on user agent. in SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, pp. 1053-1056, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 7/17/16. https://doi.org/10.1145/2911451.2914757
Pham K, Santos A, Freire J. Understanding website behavior based on user agent. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2016. p. 1053-1056 https://doi.org/10.1145/2911451.2914757
Pham, Kien ; Santos, Aécio ; Freire, Juliana. / Understanding website behavior based on user agent. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. pp. 1053-1056
@inproceedings{693a6828d58942bc97d3b08d7d74c941,
title = "Understanding website behavior based on user agent",
abstract = "Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them. In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.",
keywords = "Stealth crawling, User-agent string, Web cloaking, Web crawler detection",
author = "Kien Pham and A{\'e}cio Santos and Juliana Freire",
year = "2016",
month = "7",
day = "7",
doi = "10.1145/2911451.2914757",
language = "English (US)",
pages = "1053--1056",
booktitle = "SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Understanding website behavior based on user agent

AU - Pham, Kien

AU - Santos, Aécio

AU - Freire, Juliana

PY - 2016/7/7

Y1 - 2016/7/7

N2 - Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them. In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.

AB - Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them. In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.

KW - Stealth crawling

KW - User-agent string

KW - Web cloaking

KW - Web crawler detection

UR - http://www.scopus.com/inward/record.url?scp=84980349497&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84980349497&partnerID=8YFLogxK

U2 - 10.1145/2911451.2914757

DO - 10.1145/2911451.2914757

M3 - Conference contribution

SP - 1053

EP - 1056

BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - Association for Computing Machinery, Inc

ER -