Compact full-text indexing of versioned document collections

Jinru He, Hao Yan, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.

    Original languageEnglish (US)
    Title of host publicationACM 18th International Conference on Information and Knowledge Management, CIKM 2009
    Pages415-424
    Number of pages10
    DOIs
    StatePublished - 2009
    EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
    Duration: Nov 2 2009Nov 6 2009

    Other

    OtherACM 18th International Conference on Information and Knowledge Management, CIKM 2009
    CountryChina
    CityHong Kong
    Period11/2/0911/6/09

    Fingerprint

    Indexing
    World Wide Web
    Wikipedia
    Inverted index
    Organizing

    Keywords

    • Inverted index
    • Inverted index compression
    • Search engines
    • Versioned documents
    • Web archives
    • Wikipedia

    ASJC Scopus subject areas

    • Business, Management and Accounting(all)
    • Decision Sciences(all)

    Cite this

    He, J., Yan, H., & Suel, T. (2009). Compact full-text indexing of versioned document collections. In ACM 18th International Conference on Information and Knowledge Management, CIKM 2009 (pp. 415-424) https://doi.org/10.1145/1645953.1646008

    Compact full-text indexing of versioned document collections. / He, Jinru; Yan, Hao; Suel, Torsten.

    ACM 18th International Conference on Information and Knowledge Management, CIKM 2009. 2009. p. 415-424.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    He, J, Yan, H & Suel, T 2009, Compact full-text indexing of versioned document collections. in ACM 18th International Conference on Information and Knowledge Management, CIKM 2009. pp. 415-424, ACM 18th International Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, 11/2/09. https://doi.org/10.1145/1645953.1646008
    He J, Yan H, Suel T. Compact full-text indexing of versioned document collections. In ACM 18th International Conference on Information and Knowledge Management, CIKM 2009. 2009. p. 415-424 https://doi.org/10.1145/1645953.1646008
    He, Jinru ; Yan, Hao ; Suel, Torsten. / Compact full-text indexing of versioned document collections. ACM 18th International Conference on Information and Knowledge Management, CIKM 2009. 2009. pp. 415-424
    @inproceedings{909646869c064ea1bbdb73770774081c,
    title = "Compact full-text indexing of versioned document collections",
    abstract = "We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.",
    keywords = "Inverted index, Inverted index compression, Search engines, Versioned documents, Web archives, Wikipedia",
    author = "Jinru He and Hao Yan and Torsten Suel",
    year = "2009",
    doi = "10.1145/1645953.1646008",
    language = "English (US)",
    isbn = "9781605585123",
    pages = "415--424",
    booktitle = "ACM 18th International Conference on Information and Knowledge Management, CIKM 2009",

    }

    TY - GEN

    T1 - Compact full-text indexing of versioned document collections

    AU - He, Jinru

    AU - Yan, Hao

    AU - Suel, Torsten

    PY - 2009

    Y1 - 2009

    N2 - We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.

    AB - We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.

    KW - Inverted index

    KW - Inverted index compression

    KW - Search engines

    KW - Versioned documents

    KW - Web archives

    KW - Wikipedia

    UR - http://www.scopus.com/inward/record.url?scp=74549161572&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=74549161572&partnerID=8YFLogxK

    U2 - 10.1145/1645953.1646008

    DO - 10.1145/1645953.1646008

    M3 - Conference contribution

    SN - 9781605585123

    SP - 415

    EP - 424

    BT - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009

    ER -