Efficient search in large textual collections with redundancy

Jiangong Zhang, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.

    Original languageEnglish (US)
    Title of host publication16th International World Wide Web Conference, WWW2007
    Pages411-420
    Number of pages10
    DOIs
    StatePublished - 2007
    Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
    Duration: May 8 2007May 12 2007

    Other

    Other16th International World Wide Web Conference, WWW2007
    CountryCanada
    CityBanff, AB
    Period5/8/075/12/07

    Fingerprint

    Search engines
    Redundancy
    Electronic mail
    World Wide Web
    Program processors
    Websites
    Internet
    Engines
    Processing
    Costs

    Keywords

    • Index compression
    • Inverted index
    • Query execution
    • Redundancy elimination
    • Search engines

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Software

    Cite this

    Zhang, J., & Suel, T. (2007). Efficient search in large textual collections with redundancy. In 16th International World Wide Web Conference, WWW2007 (pp. 411-420) https://doi.org/10.1145/1242572.1242628

    Efficient search in large textual collections with redundancy. / Zhang, Jiangong; Suel, Torsten.

    16th International World Wide Web Conference, WWW2007. 2007. p. 411-420.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Zhang, J & Suel, T 2007, Efficient search in large textual collections with redundancy. in 16th International World Wide Web Conference, WWW2007. pp. 411-420, 16th International World Wide Web Conference, WWW2007, Banff, AB, Canada, 5/8/07. https://doi.org/10.1145/1242572.1242628
    Zhang J, Suel T. Efficient search in large textual collections with redundancy. In 16th International World Wide Web Conference, WWW2007. 2007. p. 411-420 https://doi.org/10.1145/1242572.1242628
    Zhang, Jiangong ; Suel, Torsten. / Efficient search in large textual collections with redundancy. 16th International World Wide Web Conference, WWW2007. 2007. pp. 411-420
    @inproceedings{1fb54cc4f7af4f07b8aa55516f006583,
    title = "Efficient search in large textual collections with redundancy",
    abstract = "Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.",
    keywords = "Index compression, Inverted index, Query execution, Redundancy elimination, Search engines",
    author = "Jiangong Zhang and Torsten Suel",
    year = "2007",
    doi = "10.1145/1242572.1242628",
    language = "English (US)",
    isbn = "1595936548",
    pages = "411--420",
    booktitle = "16th International World Wide Web Conference, WWW2007",

    }

    TY - GEN

    T1 - Efficient search in large textual collections with redundancy

    AU - Zhang, Jiangong

    AU - Suel, Torsten

    PY - 2007

    Y1 - 2007

    N2 - Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.

    AB - Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.

    KW - Index compression

    KW - Inverted index

    KW - Query execution

    KW - Redundancy elimination

    KW - Search engines

    UR - http://www.scopus.com/inward/record.url?scp=35348858153&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=35348858153&partnerID=8YFLogxK

    U2 - 10.1145/1242572.1242628

    DO - 10.1145/1242572.1242628

    M3 - Conference contribution

    SN - 1595936548

    SN - 9781595936547

    SP - 411

    EP - 420

    BT - 16th International World Wide Web Conference, WWW2007

    ER -