Selectivity estimation for Boolean queries

Zhiyuan Chen, Flip Korn, Nick Koudas, Shanmugavelayutham Muthukrishnan

    Research output: Contribution to conferencePaper

    Abstract

    In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied. Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures. We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straightforward approach based on the independence assumption wherein correlations are not explicitly captured.

    Original languageEnglish (US)
    Pages216-225
    Number of pages10
    StatePublished - Jan 1 2000
    EventPODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - Dallas, TX, USA
    Duration: May 15 2000May 17 2000

    Conference

    ConferencePODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems
    CityDallas, TX, USA
    Period5/15/005/17/00

    ASJC Scopus subject areas

    • Software
    • Information Systems
    • Hardware and Architecture

    Cite this

    Chen, Z., Korn, F., Koudas, N., & Muthukrishnan, S. (2000). Selectivity estimation for Boolean queries. 216-225. Paper presented at PODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, .

    Selectivity estimation for Boolean queries. / Chen, Zhiyuan; Korn, Flip; Koudas, Nick; Muthukrishnan, Shanmugavelayutham.

    2000. 216-225 Paper presented at PODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, .

    Research output: Contribution to conferencePaper

    Chen, Z, Korn, F, Koudas, N & Muthukrishnan, S 2000, 'Selectivity estimation for Boolean queries', Paper presented at PODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, 5/15/00 - 5/17/00 pp. 216-225.
    Chen Z, Korn F, Koudas N, Muthukrishnan S. Selectivity estimation for Boolean queries. 2000. Paper presented at PODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, .
    Chen, Zhiyuan ; Korn, Flip ; Koudas, Nick ; Muthukrishnan, Shanmugavelayutham. / Selectivity estimation for Boolean queries. Paper presented at PODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, .10 p.
    @conference{a1822e6f5d244c7cbb3f324508895111,
    title = "Selectivity estimation for Boolean queries",
    abstract = "In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied. Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures. We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straightforward approach based on the independence assumption wherein correlations are not explicitly captured.",
    author = "Zhiyuan Chen and Flip Korn and Nick Koudas and Shanmugavelayutham Muthukrishnan",
    year = "2000",
    month = "1",
    day = "1",
    language = "English (US)",
    pages = "216--225",
    note = "PODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems ; Conference date: 15-05-2000 Through 17-05-2000",

    }

    TY - CONF

    T1 - Selectivity estimation for Boolean queries

    AU - Chen, Zhiyuan

    AU - Korn, Flip

    AU - Koudas, Nick

    AU - Muthukrishnan, Shanmugavelayutham

    PY - 2000/1/1

    Y1 - 2000/1/1

    N2 - In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied. Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures. We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straightforward approach based on the independence assumption wherein correlations are not explicitly captured.

    AB - In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied. Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures. We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straightforward approach based on the independence assumption wherein correlations are not explicitly captured.

    UR - http://www.scopus.com/inward/record.url?scp=0033688075&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0033688075&partnerID=8YFLogxK

    M3 - Paper

    AN - SCOPUS:0033688075

    SP - 216

    EP - 225

    ER -