Improved methods for static index pruning

Wei Jiang, Juan Rodriguez, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Static Index Pruning is a performance optimization technique for search engines that attempts to identify and remove index postings that are unlikely to lead to top results for typical user queries. The goal is to obtain a much smaller inverted index that can quickly return results that are (almost) as good as those for the unpruned index. We make two contributions: First, we improve on previous results for pruned index size through a careful analysis of both document and query distribution characteristics. We derive an initial model based on unigram probabilities that obtains gains over previous work in some cases, and a bigram-based approach that achieves some additional improvements. We also devise a simple method for generating query logs in the absence of real-life queries, useful in modeling top results. Our second contribution is to explore, and compare to previously proposed approaches that perform pruning based on how often documents or postings appeared in top positions in the past.

    Original languageEnglish (US)
    Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages686-695
    Number of pages10
    ISBN (Electronic)9781467390040
    DOIs
    StatePublished - Feb 2 2017
    Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
    Duration: Dec 5 2016Dec 8 2016

    Other

    Other4th IEEE International Conference on Big Data, Big Data 2016
    CountryUnited States
    CityWashington
    Period12/5/1612/8/16

    Fingerprint

    Search engines

    Keywords

    • index
    • search
    • static pruning

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Information Systems
    • Hardware and Architecture

    Cite this

    Jiang, W., Rodriguez, J., & Suel, T. (2017). Improved methods for static index pruning. In Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 (pp. 686-695). [7840661] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2016.7840661

    Improved methods for static index pruning. / Jiang, Wei; Rodriguez, Juan; Suel, Torsten.

    Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016. Institute of Electrical and Electronics Engineers Inc., 2017. p. 686-695 7840661.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Jiang, W, Rodriguez, J & Suel, T 2017, Improved methods for static index pruning. in Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016., 7840661, Institute of Electrical and Electronics Engineers Inc., pp. 686-695, 4th IEEE International Conference on Big Data, Big Data 2016, Washington, United States, 12/5/16. https://doi.org/10.1109/BigData.2016.7840661
    Jiang W, Rodriguez J, Suel T. Improved methods for static index pruning. In Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016. Institute of Electrical and Electronics Engineers Inc. 2017. p. 686-695. 7840661 https://doi.org/10.1109/BigData.2016.7840661
    Jiang, Wei ; Rodriguez, Juan ; Suel, Torsten. / Improved methods for static index pruning. Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 686-695
    @inproceedings{d1e84ed8335d4af5a5a6c0a4bad89e27,
    title = "Improved methods for static index pruning",
    abstract = "Static Index Pruning is a performance optimization technique for search engines that attempts to identify and remove index postings that are unlikely to lead to top results for typical user queries. The goal is to obtain a much smaller inverted index that can quickly return results that are (almost) as good as those for the unpruned index. We make two contributions: First, we improve on previous results for pruned index size through a careful analysis of both document and query distribution characteristics. We derive an initial model based on unigram probabilities that obtains gains over previous work in some cases, and a bigram-based approach that achieves some additional improvements. We also devise a simple method for generating query logs in the absence of real-life queries, useful in modeling top results. Our second contribution is to explore, and compare to previously proposed approaches that perform pruning based on how often documents or postings appeared in top positions in the past.",
    keywords = "index, search, static pruning",
    author = "Wei Jiang and Juan Rodriguez and Torsten Suel",
    year = "2017",
    month = "2",
    day = "2",
    doi = "10.1109/BigData.2016.7840661",
    language = "English (US)",
    pages = "686--695",
    booktitle = "Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    address = "United States",

    }

    TY - GEN

    T1 - Improved methods for static index pruning

    AU - Jiang, Wei

    AU - Rodriguez, Juan

    AU - Suel, Torsten

    PY - 2017/2/2

    Y1 - 2017/2/2

    N2 - Static Index Pruning is a performance optimization technique for search engines that attempts to identify and remove index postings that are unlikely to lead to top results for typical user queries. The goal is to obtain a much smaller inverted index that can quickly return results that are (almost) as good as those for the unpruned index. We make two contributions: First, we improve on previous results for pruned index size through a careful analysis of both document and query distribution characteristics. We derive an initial model based on unigram probabilities that obtains gains over previous work in some cases, and a bigram-based approach that achieves some additional improvements. We also devise a simple method for generating query logs in the absence of real-life queries, useful in modeling top results. Our second contribution is to explore, and compare to previously proposed approaches that perform pruning based on how often documents or postings appeared in top positions in the past.

    AB - Static Index Pruning is a performance optimization technique for search engines that attempts to identify and remove index postings that are unlikely to lead to top results for typical user queries. The goal is to obtain a much smaller inverted index that can quickly return results that are (almost) as good as those for the unpruned index. We make two contributions: First, we improve on previous results for pruned index size through a careful analysis of both document and query distribution characteristics. We derive an initial model based on unigram probabilities that obtains gains over previous work in some cases, and a bigram-based approach that achieves some additional improvements. We also devise a simple method for generating query logs in the absence of real-life queries, useful in modeling top results. Our second contribution is to explore, and compare to previously proposed approaches that perform pruning based on how often documents or postings appeared in top positions in the past.

    KW - index

    KW - search

    KW - static pruning

    UR - http://www.scopus.com/inward/record.url?scp=85015198617&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85015198617&partnerID=8YFLogxK

    U2 - 10.1109/BigData.2016.7840661

    DO - 10.1109/BigData.2016.7840661

    M3 - Conference contribution

    AN - SCOPUS:85015198617

    SP - 686

    EP - 695

    BT - Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

    PB - Institute of Electrical and Electronics Engineers Inc.

    ER -