Optimizing positional index structures for versioned document collections

Jinru He, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.

    Original languageEnglish (US)
    Title of host publicationSIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval
    Pages245-254
    Number of pages10
    DOIs
    StatePublished - 2012
    Event35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012 - Portland, OR, United States
    Duration: Aug 12 2012Aug 16 2012

    Other

    Other35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012
    CountryUnited States
    CityPortland, OR
    Period8/12/128/16/12

    Fingerprint

    Query processing
    Redundancy
    Control systems
    Experiments

    Keywords

    • index compression
    • Inverted index
    • positional index structures
    • redundancy elimination
    • versioned documents

    ASJC Scopus subject areas

    • Information Systems

    Cite this

    He, J., & Suel, T. (2012). Optimizing positional index structures for versioned document collections. In SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 245-254) https://doi.org/10.1145/2348283.2348319

    Optimizing positional index structures for versioned document collections. / He, Jinru; Suel, Torsten.

    SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012. p. 245-254.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    He, J & Suel, T 2012, Optimizing positional index structures for versioned document collections. in SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 245-254, 35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, Portland, OR, United States, 8/12/12. https://doi.org/10.1145/2348283.2348319
    He J, Suel T. Optimizing positional index structures for versioned document collections. In SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012. p. 245-254 https://doi.org/10.1145/2348283.2348319
    He, Jinru ; Suel, Torsten. / Optimizing positional index structures for versioned document collections. SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012. pp. 245-254
    @inproceedings{1bcd7c20a3214552840303c18f78f435,
    title = "Optimizing positional index structures for versioned document collections",
    abstract = "Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.",
    keywords = "index compression, Inverted index, positional index structures, redundancy elimination, versioned documents",
    author = "Jinru He and Torsten Suel",
    year = "2012",
    doi = "10.1145/2348283.2348319",
    language = "English (US)",
    isbn = "9781450316583",
    pages = "245--254",
    booktitle = "SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval",

    }

    TY - GEN

    T1 - Optimizing positional index structures for versioned document collections

    AU - He, Jinru

    AU - Suel, Torsten

    PY - 2012

    Y1 - 2012

    N2 - Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.

    AB - Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.

    KW - index compression

    KW - Inverted index

    KW - positional index structures

    KW - redundancy elimination

    KW - versioned documents

    UR - http://www.scopus.com/inward/record.url?scp=84866620381&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84866620381&partnerID=8YFLogxK

    U2 - 10.1145/2348283.2348319

    DO - 10.1145/2348283.2348319

    M3 - Conference contribution

    SN - 9781450316583

    SP - 245

    EP - 254

    BT - SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval

    ER -