Faster top-k document retrieval using block-max indexes

Shuai Ding, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

    Original languageEnglish (US)
    Title of host publicationSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
    Pages993-1002
    Number of pages10
    DOIs
    StatePublished - 2011
    Event34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11 - Beijing, China
    Duration: Jul 24 2011Jul 28 2011

    Other

    Other34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11
    CountryChina
    CityBeijing
    Period7/24/117/28/11

    Fingerprint

    Query processing
    Search engines

    Keywords

    • Early termination
    • Inverted index
    • IR query processing
    • Top-k query processing

    ASJC Scopus subject areas

    • Information Systems

    Cite this

    Ding, S., & Suel, T. (2011). Faster top-k document retrieval using block-max indexes. In SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 993-1002) https://doi.org/10.1145/2009916.2010048

    Faster top-k document retrieval using block-max indexes. / Ding, Shuai; Suel, Torsten.

    SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. p. 993-1002.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Ding, S & Suel, T 2011, Faster top-k document retrieval using block-max indexes. in SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 993-1002, 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, Beijing, China, 7/24/11. https://doi.org/10.1145/2009916.2010048
    Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. p. 993-1002 https://doi.org/10.1145/2009916.2010048
    Ding, Shuai ; Suel, Torsten. / Faster top-k document retrieval using block-max indexes. SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. pp. 993-1002
    @inproceedings{62729124ba2343689084aedc21e928cd,
    title = "Faster top-k document retrieval using block-max indexes",
    abstract = "Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.",
    keywords = "Early termination, Inverted index, IR query processing, Top-k query processing",
    author = "Shuai Ding and Torsten Suel",
    year = "2011",
    doi = "10.1145/2009916.2010048",
    language = "English (US)",
    isbn = "9781450309349",
    pages = "993--1002",
    booktitle = "SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval",

    }

    TY - GEN

    T1 - Faster top-k document retrieval using block-max indexes

    AU - Ding, Shuai

    AU - Suel, Torsten

    PY - 2011

    Y1 - 2011

    N2 - Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

    AB - Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

    KW - Early termination

    KW - Inverted index

    KW - IR query processing

    KW - Top-k query processing

    UR - http://www.scopus.com/inward/record.url?scp=80052124546&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=80052124546&partnerID=8YFLogxK

    U2 - 10.1145/2009916.2010048

    DO - 10.1145/2009916.2010048

    M3 - Conference contribution

    SN - 9781450309349

    SP - 993

    EP - 1002

    BT - SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

    ER -