Estimating statistical aggregates on probabilistic data streams

T. S. Jayram, Andrew McGregor, Shanmugavelayutham Muthukrishnan, Erik Vee

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    The probabilistic-stream model was introduced by Jayram et al. [20]. It is a generalization of the data stream model that is suited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values. Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median of a probabilistic stream.

    Original languageEnglish (US)
    Title of host publicationProceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
    Pages243-252
    Number of pages10
    DOIs
    StatePublished - Oct 29 2007
    Event26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007 - Beijing, China
    Duration: Jun 11 2007Jun 13 2007

    Publication series

    NameProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

    Conference

    Conference26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
    CountryChina
    CityBeijing
    Period6/11/076/13/07

    Fingerprint

    Agglomeration
    Search engines
    Probability distributions
    Sensor networks
    Uncertainty

    Keywords

    • Frequency moments
    • Mean
    • Median
    • OLAP
    • Probabilistic streams

    ASJC Scopus subject areas

    • Software
    • Information Systems
    • Hardware and Architecture

    Cite this

    Jayram, T. S., McGregor, A., Muthukrishnan, S., & Vee, E. (2007). Estimating statistical aggregates on probabilistic data streams. In Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007 (pp. 243-252). (Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems). https://doi.org/10.1145/1265530.1265565

    Estimating statistical aggregates on probabilistic data streams. / Jayram, T. S.; McGregor, Andrew; Muthukrishnan, Shanmugavelayutham; Vee, Erik.

    Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007. 2007. p. 243-252 (Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Jayram, TS, McGregor, A, Muthukrishnan, S & Vee, E 2007, Estimating statistical aggregates on probabilistic data streams. in Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007. Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 243-252, 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007, Beijing, China, 6/11/07. https://doi.org/10.1145/1265530.1265565
    Jayram TS, McGregor A, Muthukrishnan S, Vee E. Estimating statistical aggregates on probabilistic data streams. In Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007. 2007. p. 243-252. (Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems). https://doi.org/10.1145/1265530.1265565
    Jayram, T. S. ; McGregor, Andrew ; Muthukrishnan, Shanmugavelayutham ; Vee, Erik. / Estimating statistical aggregates on probabilistic data streams. Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007. 2007. pp. 243-252 (Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems).
    @inproceedings{5ad5a329e41745b2b011a1bb9fe1e7a8,
    title = "Estimating statistical aggregates on probabilistic data streams",
    abstract = "The probabilistic-stream model was introduced by Jayram et al. [20]. It is a generalization of the data stream model that is suited to handling {"}probabilistic{"} data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical {"}deterministic{"} streams where each item is deterministically one of the domain values. Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median of a probabilistic stream.",
    keywords = "Frequency moments, Mean, Median, OLAP, Probabilistic streams",
    author = "Jayram, {T. S.} and Andrew McGregor and Shanmugavelayutham Muthukrishnan and Erik Vee",
    year = "2007",
    month = "10",
    day = "29",
    doi = "10.1145/1265530.1265565",
    language = "English (US)",
    isbn = "1595936858",
    series = "Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems",
    pages = "243--252",
    booktitle = "Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007",

    }

    TY - GEN

    T1 - Estimating statistical aggregates on probabilistic data streams

    AU - Jayram, T. S.

    AU - McGregor, Andrew

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Vee, Erik

    PY - 2007/10/29

    Y1 - 2007/10/29

    N2 - The probabilistic-stream model was introduced by Jayram et al. [20]. It is a generalization of the data stream model that is suited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values. Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median of a probabilistic stream.

    AB - The probabilistic-stream model was introduced by Jayram et al. [20]. It is a generalization of the data stream model that is suited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values. Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median of a probabilistic stream.

    KW - Frequency moments

    KW - Mean

    KW - Median

    KW - OLAP

    KW - Probabilistic streams

    UR - http://www.scopus.com/inward/record.url?scp=35448934923&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=35448934923&partnerID=8YFLogxK

    U2 - 10.1145/1265530.1265565

    DO - 10.1145/1265530.1265565

    M3 - Conference contribution

    SN - 1595936858

    SN - 9781595936851

    T3 - Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

    SP - 243

    EP - 252

    BT - Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007

    ER -