Design and implementation of a high-performance distributed web crawler

Vladislav Shkapenyuk, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

    Original languageEnglish (US)
    Title of host publicationProceedings - International Conference on Data Engineering
    EditorsR Agrawal, K Dittrich, A Ngu
    Pages357-368
    Number of pages12
    StatePublished - 2002
    Event18th International Conference on Data Engineering - San Jose, CA, United States
    Duration: Feb 26 2002Mar 1 2002

    Other

    Other18th International Conference on Data Engineering
    CountryUnited States
    CitySan Jose, CA
    Period2/26/023/1/02

    Fingerprint

    Software architecture
    Search engines
    Network performance
    Costs
    Web crawler

    ASJC Scopus subject areas

    • Software
    • Engineering(all)
    • Engineering (miscellaneous)

    Cite this

    Shkapenyuk, V., & Suel, T. (2002). Design and implementation of a high-performance distributed web crawler. In R. Agrawal, K. Dittrich, & A. Ngu (Eds.), Proceedings - International Conference on Data Engineering (pp. 357-368)

    Design and implementation of a high-performance distributed web crawler. / Shkapenyuk, Vladislav; Suel, Torsten.

    Proceedings - International Conference on Data Engineering. ed. / R Agrawal; K Dittrich; A Ngu. 2002. p. 357-368.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Shkapenyuk, V & Suel, T 2002, Design and implementation of a high-performance distributed web crawler. in R Agrawal, K Dittrich & A Ngu (eds), Proceedings - International Conference on Data Engineering. pp. 357-368, 18th International Conference on Data Engineering, San Jose, CA, United States, 2/26/02.
    Shkapenyuk V, Suel T. Design and implementation of a high-performance distributed web crawler. In Agrawal R, Dittrich K, Ngu A, editors, Proceedings - International Conference on Data Engineering. 2002. p. 357-368
    Shkapenyuk, Vladislav ; Suel, Torsten. / Design and implementation of a high-performance distributed web crawler. Proceedings - International Conference on Data Engineering. editor / R Agrawal ; K Dittrich ; A Ngu. 2002. pp. 357-368
    @inproceedings{429a8f20dee24f359d3a96bbf9ccd4bb,
    title = "Design and implementation of a high-performance distributed web crawler",
    abstract = "Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.",
    author = "Vladislav Shkapenyuk and Torsten Suel",
    year = "2002",
    language = "English (US)",
    pages = "357--368",
    editor = "R Agrawal and K Dittrich and A Ngu",
    booktitle = "Proceedings - International Conference on Data Engineering",

    }

    TY - GEN

    T1 - Design and implementation of a high-performance distributed web crawler

    AU - Shkapenyuk, Vladislav

    AU - Suel, Torsten

    PY - 2002

    Y1 - 2002

    N2 - Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

    AB - Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

    UR - http://www.scopus.com/inward/record.url?scp=0036204395&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0036204395&partnerID=8YFLogxK

    M3 - Conference contribution

    AN - SCOPUS:0036204395

    SP - 357

    EP - 368

    BT - Proceedings - International Conference on Data Engineering

    A2 - Agrawal, R

    A2 - Dittrich, K

    A2 - Ngu, A

    ER -