Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems

Michał Siedlaczek, Qi Wang, Yen Yu Chen, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Many content-based image search and instance retrieval systems implement bag-of-visual-words strategies for candidate selection. Visual processing of an image results in hundreds of visual words that make up a document, and these words are used to build an inverted index. Query processing then consists of an initial candidate selection phase that queries the inverted index, followed by more complex reranking of the candidates using various image features. The initial phase typically uses disjunctive top-k query processing algorithms originally proposed for searching text collections.Our objective in this paper is to optimize the performance of disjunctive top-k computation for candidate selection in content-based instance retrieval systems. While there has been extensive previous work on optimizing this phase for textual search engines, we are unaware of any published work that studies this problem for instance retrieval, where both index and query data are quite different from the distributions commonly found and exploited in the textual case. Using data from a commercial large-scale instance retrieval system, we address this challenge in three steps. First, we analyze the quantitative properties of index structures and queries in the system, and discuss how they differ from the case of text retrieval. Second, we describe an optimized term-at-a-time retrieval strategy that significantly outperforms baseline term-at-a-time and document-at-a-time strategies, achieving up to 66% speed-up over the most efficient baseline. Finally, we show that due to the different properties of the data, several common safe and unsafe early termination techniques from the literature fail to provide any significant performance benefits.

    Original languageEnglish (US)
    Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
    EditorsYang Song, Bing Liu, Kisung Lee, Naoki Abe, Calton Pu, Mu Qiao, Nesreen Ahmed, Donald Kossmann, Jeffrey Saltz, Jiliang Tang, Jingrui He, Huan Liu, Xiaohua Hu
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages821-830
    Number of pages10
    ISBN (Electronic)9781538650356
    DOIs
    StatePublished - Jan 22 2019
    Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
    Duration: Dec 10 2018Dec 13 2018

    Publication series

    NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

    Conference

    Conference2018 IEEE International Conference on Big Data, Big Data 2018
    CountryUnited States
    CitySeattle
    Period12/10/1812/13/18

    Fingerprint

    Content based retrieval
    Query processing
    Search engines
    Processing

    Keywords

    • bag-of-visual-words
    • candidate selection
    • cascade ranking
    • image retrieval
    • inverted index
    • top-k search

    ASJC Scopus subject areas

    • Computer Science Applications
    • Information Systems

    Cite this

    Siedlaczek, M., Wang, Q., Chen, Y. Y., & Suel, T. (2019). Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems. In Y. Song, B. Liu, K. Lee, N. Abe, C. Pu, M. Qiao, N. Ahmed, D. Kossmann, J. Saltz, J. Tang, J. He, H. Liu, ... X. Hu (Eds.), Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 (pp. 821-830). [8621935] (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2018.8621935

    Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems. / Siedlaczek, Michał; Wang, Qi; Chen, Yen Yu; Suel, Torsten.

    Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. ed. / Yang Song; Bing Liu; Kisung Lee; Naoki Abe; Calton Pu; Mu Qiao; Nesreen Ahmed; Donald Kossmann; Jeffrey Saltz; Jiliang Tang; Jingrui He; Huan Liu; Xiaohua Hu. Institute of Electrical and Electronics Engineers Inc., 2019. p. 821-830 8621935 (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Siedlaczek, M, Wang, Q, Chen, YY & Suel, T 2019, Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems. in Y Song, B Liu, K Lee, N Abe, C Pu, M Qiao, N Ahmed, D Kossmann, J Saltz, J Tang, J He, H Liu & X Hu (eds), Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018., 8621935, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018, Institute of Electrical and Electronics Engineers Inc., pp. 821-830, 2018 IEEE International Conference on Big Data, Big Data 2018, Seattle, United States, 12/10/18. https://doi.org/10.1109/BigData.2018.8621935
    Siedlaczek M, Wang Q, Chen YY, Suel T. Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems. In Song Y, Liu B, Lee K, Abe N, Pu C, Qiao M, Ahmed N, Kossmann D, Saltz J, Tang J, He J, Liu H, Hu X, editors, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 821-830. 8621935. (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018). https://doi.org/10.1109/BigData.2018.8621935
    Siedlaczek, Michał ; Wang, Qi ; Chen, Yen Yu ; Suel, Torsten. / Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems. Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. editor / Yang Song ; Bing Liu ; Kisung Lee ; Naoki Abe ; Calton Pu ; Mu Qiao ; Nesreen Ahmed ; Donald Kossmann ; Jeffrey Saltz ; Jiliang Tang ; Jingrui He ; Huan Liu ; Xiaohua Hu. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 821-830 (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018).
    @inproceedings{d8cb383cf0b5431e84f0c272cf618d5e,
    title = "Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems",
    abstract = "Many content-based image search and instance retrieval systems implement bag-of-visual-words strategies for candidate selection. Visual processing of an image results in hundreds of visual words that make up a document, and these words are used to build an inverted index. Query processing then consists of an initial candidate selection phase that queries the inverted index, followed by more complex reranking of the candidates using various image features. The initial phase typically uses disjunctive top-k query processing algorithms originally proposed for searching text collections.Our objective in this paper is to optimize the performance of disjunctive top-k computation for candidate selection in content-based instance retrieval systems. While there has been extensive previous work on optimizing this phase for textual search engines, we are unaware of any published work that studies this problem for instance retrieval, where both index and query data are quite different from the distributions commonly found and exploited in the textual case. Using data from a commercial large-scale instance retrieval system, we address this challenge in three steps. First, we analyze the quantitative properties of index structures and queries in the system, and discuss how they differ from the case of text retrieval. Second, we describe an optimized term-at-a-time retrieval strategy that significantly outperforms baseline term-at-a-time and document-at-a-time strategies, achieving up to 66{\%} speed-up over the most efficient baseline. Finally, we show that due to the different properties of the data, several common safe and unsafe early termination techniques from the literature fail to provide any significant performance benefits.",
    keywords = "bag-of-visual-words, candidate selection, cascade ranking, image retrieval, inverted index, top-k search",
    author = "Michał Siedlaczek and Qi Wang and Chen, {Yen Yu} and Torsten Suel",
    year = "2019",
    month = "1",
    day = "22",
    doi = "10.1109/BigData.2018.8621935",
    language = "English (US)",
    series = "Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    pages = "821--830",
    editor = "Yang Song and Bing Liu and Kisung Lee and Naoki Abe and Calton Pu and Mu Qiao and Nesreen Ahmed and Donald Kossmann and Jeffrey Saltz and Jiliang Tang and Jingrui He and Huan Liu and Xiaohua Hu",
    booktitle = "Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018",

    }

    TY - GEN

    T1 - Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems

    AU - Siedlaczek, Michał

    AU - Wang, Qi

    AU - Chen, Yen Yu

    AU - Suel, Torsten

    PY - 2019/1/22

    Y1 - 2019/1/22

    N2 - Many content-based image search and instance retrieval systems implement bag-of-visual-words strategies for candidate selection. Visual processing of an image results in hundreds of visual words that make up a document, and these words are used to build an inverted index. Query processing then consists of an initial candidate selection phase that queries the inverted index, followed by more complex reranking of the candidates using various image features. The initial phase typically uses disjunctive top-k query processing algorithms originally proposed for searching text collections.Our objective in this paper is to optimize the performance of disjunctive top-k computation for candidate selection in content-based instance retrieval systems. While there has been extensive previous work on optimizing this phase for textual search engines, we are unaware of any published work that studies this problem for instance retrieval, where both index and query data are quite different from the distributions commonly found and exploited in the textual case. Using data from a commercial large-scale instance retrieval system, we address this challenge in three steps. First, we analyze the quantitative properties of index structures and queries in the system, and discuss how they differ from the case of text retrieval. Second, we describe an optimized term-at-a-time retrieval strategy that significantly outperforms baseline term-at-a-time and document-at-a-time strategies, achieving up to 66% speed-up over the most efficient baseline. Finally, we show that due to the different properties of the data, several common safe and unsafe early termination techniques from the literature fail to provide any significant performance benefits.

    AB - Many content-based image search and instance retrieval systems implement bag-of-visual-words strategies for candidate selection. Visual processing of an image results in hundreds of visual words that make up a document, and these words are used to build an inverted index. Query processing then consists of an initial candidate selection phase that queries the inverted index, followed by more complex reranking of the candidates using various image features. The initial phase typically uses disjunctive top-k query processing algorithms originally proposed for searching text collections.Our objective in this paper is to optimize the performance of disjunctive top-k computation for candidate selection in content-based instance retrieval systems. While there has been extensive previous work on optimizing this phase for textual search engines, we are unaware of any published work that studies this problem for instance retrieval, where both index and query data are quite different from the distributions commonly found and exploited in the textual case. Using data from a commercial large-scale instance retrieval system, we address this challenge in three steps. First, we analyze the quantitative properties of index structures and queries in the system, and discuss how they differ from the case of text retrieval. Second, we describe an optimized term-at-a-time retrieval strategy that significantly outperforms baseline term-at-a-time and document-at-a-time strategies, achieving up to 66% speed-up over the most efficient baseline. Finally, we show that due to the different properties of the data, several common safe and unsafe early termination techniques from the literature fail to provide any significant performance benefits.

    KW - bag-of-visual-words

    KW - candidate selection

    KW - cascade ranking

    KW - image retrieval

    KW - inverted index

    KW - top-k search

    UR - http://www.scopus.com/inward/record.url?scp=85062634900&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85062634900&partnerID=8YFLogxK

    U2 - 10.1109/BigData.2018.8621935

    DO - 10.1109/BigData.2018.8621935

    M3 - Conference contribution

    T3 - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

    SP - 821

    EP - 830

    BT - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

    A2 - Song, Yang

    A2 - Liu, Bing

    A2 - Lee, Kisung

    A2 - Abe, Naoki

    A2 - Pu, Calton

    A2 - Qiao, Mu

    A2 - Ahmed, Nesreen

    A2 - Kossmann, Donald

    A2 - Saltz, Jeffrey

    A2 - Tang, Jiliang

    A2 - He, Jingrui

    A2 - Liu, Huan

    A2 - Hu, Xiaohua

    PB - Institute of Electrical and Electronics Engineers Inc.

    ER -