Scalable manipulation of archival web graphs

Yasemin Avcular, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.

    Original languageEnglish (US)
    Title of host publicationCIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval
    Pages27-32
    Number of pages6
    DOIs
    StatePublished - 2011
    Event9th Workshop on Large-Scale and Distributed Systems for Information Retrieval, LSDS-IR'11 - Glasgow, United Kingdom
    Duration: Oct 28 2011Oct 28 2011

    Other

    Other9th Workshop on Large-Scale and Distributed Systems for Information Retrieval, LSDS-IR'11
    CountryUnited Kingdom
    CityGlasgow
    Period10/28/1110/28/11

    Fingerprint

    Manipulation
    Graph
    World Wide Web
    MapReduce
    Online algorithms
    Repository
    PageRank
    Connectivity
    Compression

    Keywords

    • archival web graphs
    • hadoop
    • mapreduce

    ASJC Scopus subject areas

    • Business, Management and Accounting(all)
    • Decision Sciences(all)

    Cite this

    Avcular, Y., & Suel, T. (2011). Scalable manipulation of archival web graphs. In CIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval (pp. 27-32) https://doi.org/10.1145/2064730.2064739

    Scalable manipulation of archival web graphs. / Avcular, Yasemin; Suel, Torsten.

    CIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval. 2011. p. 27-32.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Avcular, Y & Suel, T 2011, Scalable manipulation of archival web graphs. in CIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval. pp. 27-32, 9th Workshop on Large-Scale and Distributed Systems for Information Retrieval, LSDS-IR'11, Glasgow, United Kingdom, 10/28/11. https://doi.org/10.1145/2064730.2064739
    Avcular Y, Suel T. Scalable manipulation of archival web graphs. In CIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval. 2011. p. 27-32 https://doi.org/10.1145/2064730.2064739
    Avcular, Yasemin ; Suel, Torsten. / Scalable manipulation of archival web graphs. CIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval. 2011. pp. 27-32
    @inproceedings{3d4457eb7836414cb92dfec8c27ee391,
    title = "Scalable manipulation of archival web graphs",
    abstract = "In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.",
    keywords = "archival web graphs, hadoop, mapreduce",
    author = "Yasemin Avcular and Torsten Suel",
    year = "2011",
    doi = "10.1145/2064730.2064739",
    language = "English (US)",
    isbn = "9781450309592",
    pages = "27--32",
    booktitle = "CIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval",

    }

    TY - GEN

    T1 - Scalable manipulation of archival web graphs

    AU - Avcular, Yasemin

    AU - Suel, Torsten

    PY - 2011

    Y1 - 2011

    N2 - In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.

    AB - In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.

    KW - archival web graphs

    KW - hadoop

    KW - mapreduce

    UR - http://www.scopus.com/inward/record.url?scp=83255193379&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=83255193379&partnerID=8YFLogxK

    U2 - 10.1145/2064730.2064739

    DO - 10.1145/2064730.2064739

    M3 - Conference contribution

    AN - SCOPUS:83255193379

    SN - 9781450309592

    SP - 27

    EP - 32

    BT - CIKM 2011 Glasgow: LSDS-IR'11 - Proceedings of the 9th Workshop on Large-Scale and Distributed Informational Retrieval

    ER -