Scalable techniques for document identifier assignment in inverted indexes

Shuai Ding, Josh Attenberg, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 19th International Conference on World Wide Web, WWW '10
    Pages311-320
    Number of pages10
    DOIs
    StatePublished - 2010
    Event19th International World Wide Web Conference, WWW2010 - Raleigh, NC, United States
    Duration: Apr 26 2010Apr 30 2010

    Other

    Other19th International World Wide Web Conference, WWW2010
    CountryUnited States
    CityRaleigh, NC
    Period4/26/104/30/10

    Fingerprint

    Traveling salesman problem
    Query processing
    Search engines
    Sorting
    Data structures
    Websites

    Keywords

    • documentID reassignment
    • index compression

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Computer Science Applications

    Cite this

    Ding, S., Attenberg, J., & Suel, T. (2010). Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th International Conference on World Wide Web, WWW '10 (pp. 311-320) https://doi.org/10.1145/1772690.1772723

    Scalable techniques for document identifier assignment in inverted indexes. / Ding, Shuai; Attenberg, Josh; Suel, Torsten.

    Proceedings of the 19th International Conference on World Wide Web, WWW '10. 2010. p. 311-320.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Ding, S, Attenberg, J & Suel, T 2010, Scalable techniques for document identifier assignment in inverted indexes. in Proceedings of the 19th International Conference on World Wide Web, WWW '10. pp. 311-320, 19th International World Wide Web Conference, WWW2010, Raleigh, NC, United States, 4/26/10. https://doi.org/10.1145/1772690.1772723
    Ding S, Attenberg J, Suel T. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th International Conference on World Wide Web, WWW '10. 2010. p. 311-320 https://doi.org/10.1145/1772690.1772723
    Ding, Shuai ; Attenberg, Josh ; Suel, Torsten. / Scalable techniques for document identifier assignment in inverted indexes. Proceedings of the 19th International Conference on World Wide Web, WWW '10. 2010. pp. 311-320
    @inproceedings{927b6b509b0f472187fb0522fdd32bbb,
    title = "Scalable techniques for document identifier assignment in inverted indexes",
    abstract = "Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.",
    keywords = "documentID reassignment, index compression",
    author = "Shuai Ding and Josh Attenberg and Torsten Suel",
    year = "2010",
    doi = "10.1145/1772690.1772723",
    language = "English (US)",
    isbn = "9781605587998",
    pages = "311--320",
    booktitle = "Proceedings of the 19th International Conference on World Wide Web, WWW '10",

    }

    TY - GEN

    T1 - Scalable techniques for document identifier assignment in inverted indexes

    AU - Ding, Shuai

    AU - Attenberg, Josh

    AU - Suel, Torsten

    PY - 2010

    Y1 - 2010

    N2 - Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

    AB - Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

    KW - documentID reassignment

    KW - index compression

    UR - http://www.scopus.com/inward/record.url?scp=77954565988&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=77954565988&partnerID=8YFLogxK

    U2 - 10.1145/1772690.1772723

    DO - 10.1145/1772690.1772723

    M3 - Conference contribution

    SN - 9781605587998

    SP - 311

    EP - 320

    BT - Proceedings of the 19th International Conference on World Wide Web, WWW '10

    ER -