Inverted index compression and query processing with optimized document ordering

Hao Yan, Shuai Ding, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through-put, and many index compression techniques have been studied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

    Original languageEnglish (US)
    Title of host publicationWWW'09 - Proceedings of the 18th International World Wide Web Conference
    Pages401-410
    Number of pages10
    DOIs
    StatePublished - 2009
    Event18th International World Wide Web Conference, WWW 2009 - Madrid, Spain
    Duration: Apr 20 2009Apr 24 2009

    Other

    Other18th International World Wide Web Conference, WWW 2009
    CountrySpain
    CityMadrid
    Period4/20/094/24/09

    Fingerprint

    Query processing
    Search engines
    World Wide Web
    Websites
    Experiments

    Keywords

    • Document ordering
    • Index compression
    • Inverted index
    • IR query processing
    • Search engines

    ASJC Scopus subject areas

    • Computer Networks and Communications

    Cite this

    Yan, H., Ding, S., & Suel, T. (2009). Inverted index compression and query processing with optimized document ordering. In WWW'09 - Proceedings of the 18th International World Wide Web Conference (pp. 401-410) https://doi.org/10.1145/1526709.1526764

    Inverted index compression and query processing with optimized document ordering. / Yan, Hao; Ding, Shuai; Suel, Torsten.

    WWW'09 - Proceedings of the 18th International World Wide Web Conference. 2009. p. 401-410.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Yan, H, Ding, S & Suel, T 2009, Inverted index compression and query processing with optimized document ordering. in WWW'09 - Proceedings of the 18th International World Wide Web Conference. pp. 401-410, 18th International World Wide Web Conference, WWW 2009, Madrid, Spain, 4/20/09. https://doi.org/10.1145/1526709.1526764
    Yan H, Ding S, Suel T. Inverted index compression and query processing with optimized document ordering. In WWW'09 - Proceedings of the 18th International World Wide Web Conference. 2009. p. 401-410 https://doi.org/10.1145/1526709.1526764
    Yan, Hao ; Ding, Shuai ; Suel, Torsten. / Inverted index compression and query processing with optimized document ordering. WWW'09 - Proceedings of the 18th International World Wide Web Conference. 2009. pp. 401-410
    @inproceedings{b8029409fa254647a2dd2fa61891fc3c,
    title = "Inverted index compression and query processing with optimized document ordering",
    abstract = "Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through-put, and many index compression techniques have been studied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Copyright is held by the International World Wide Web Conference Committee (IW3C2).",
    keywords = "Document ordering, Index compression, Inverted index, IR query processing, Search engines",
    author = "Hao Yan and Shuai Ding and Torsten Suel",
    year = "2009",
    doi = "10.1145/1526709.1526764",
    language = "English (US)",
    isbn = "9781605584874",
    pages = "401--410",
    booktitle = "WWW'09 - Proceedings of the 18th International World Wide Web Conference",

    }

    TY - GEN

    T1 - Inverted index compression and query processing with optimized document ordering

    AU - Yan, Hao

    AU - Ding, Shuai

    AU - Suel, Torsten

    PY - 2009

    Y1 - 2009

    N2 - Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through-put, and many index compression techniques have been studied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

    AB - Web search engines use highly optimized compression schemes to decrease inverted index size and improve query through-put, and many index compression techniques have been studied in the literature. One approach taken by several recent studies [7, 23, 25, 6, 24] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

    KW - Document ordering

    KW - Index compression

    KW - Inverted index

    KW - IR query processing

    KW - Search engines

    UR - http://www.scopus.com/inward/record.url?scp=84865646680&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84865646680&partnerID=8YFLogxK

    U2 - 10.1145/1526709.1526764

    DO - 10.1145/1526709.1526764

    M3 - Conference contribution

    AN - SCOPUS:84865646680

    SN - 9781605584874

    SP - 401

    EP - 410

    BT - WWW'09 - Proceedings of the 18th International World Wide Web Conference

    ER -