Faster temporal range queries over versioned text

Jinru He, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most commonly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corresponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only the relevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index compression and query throughput. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.

    Original languageEnglish (US)
    Title of host publicationSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
    Pages565-574
    Number of pages10
    DOIs
    StatePublished - 2011
    Event34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11 - Beijing, China
    Duration: Jul 24 2011Jul 28 2011

    Other

    Other34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11
    CountryChina
    CityBeijing
    Period7/24/117/28/11

    Fingerprint

    Throughput
    Internet
    Data storage equipment
    Experiments

    Keywords

    • Inverted index
    • Query processing
    • Range queries
    • Temporal search
    • Versioned documents

    ASJC Scopus subject areas

    • Information Systems

    Cite this

    He, J., & Suel, T. (2011). Faster temporal range queries over versioned text. In SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 565-574) https://doi.org/10.1145/2009916.2009993

    Faster temporal range queries over versioned text. / He, Jinru; Suel, Torsten.

    SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. p. 565-574.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    He, J & Suel, T 2011, Faster temporal range queries over versioned text. in SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 565-574, 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, Beijing, China, 7/24/11. https://doi.org/10.1145/2009916.2009993
    He J, Suel T. Faster temporal range queries over versioned text. In SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. p. 565-574 https://doi.org/10.1145/2009916.2009993
    He, Jinru ; Suel, Torsten. / Faster temporal range queries over versioned text. SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. pp. 565-574
    @inproceedings{cb6717cb179a4b108b7fffd70af40d3c,
    title = "Faster temporal range queries over versioned text",
    abstract = "Versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most commonly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corresponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only the relevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index compression and query throughput. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.",
    keywords = "Inverted index, Query processing, Range queries, Temporal search, Versioned documents",
    author = "Jinru He and Torsten Suel",
    year = "2011",
    doi = "10.1145/2009916.2009993",
    language = "English (US)",
    isbn = "9781450309349",
    pages = "565--574",
    booktitle = "SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval",

    }

    TY - GEN

    T1 - Faster temporal range queries over versioned text

    AU - He, Jinru

    AU - Suel, Torsten

    PY - 2011

    Y1 - 2011

    N2 - Versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most commonly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corresponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only the relevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index compression and query throughput. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.

    AB - Versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most commonly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corresponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only the relevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index compression and query throughput. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.

    KW - Inverted index

    KW - Query processing

    KW - Range queries

    KW - Temporal search

    KW - Versioned documents

    UR - http://www.scopus.com/inward/record.url?scp=80052128215&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=80052128215&partnerID=8YFLogxK

    U2 - 10.1145/2009916.2009993

    DO - 10.1145/2009916.2009993

    M3 - Conference contribution

    SN - 9781450309349

    SP - 565

    EP - 574

    BT - SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

    ER -