Holistic aggregates in a networked world: Distributed tracking of approximate quantiles

Graham Cormode, Minos Garofalakis, Shanmugavelayutham Muthukrishnan, Rajeev Rastogi

    Research output: Contribution to journalConference article

    Abstract

    While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting - our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., "heavy-hitters" queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.

    Original languageEnglish (US)
    Pages (from-to)25-36
    Number of pages12
    JournalProceedings of the ACM SIGMOD International Conference on Management of Data
    DOIs
    StatePublished - Dec 1 2005
    EventSIGMOD 2005: ACM SIGMOD International Conference on Management of Data - Baltimore, MD, United States
    Duration: Jun 14 2005Jun 16 2005

    Fingerprint

    Communication
    Telecommunication networks
    Monitoring
    Costs
    Experiments

    ASJC Scopus subject areas

    • Software
    • Information Systems

    Cite this

    Holistic aggregates in a networked world : Distributed tracking of approximate quantiles. / Cormode, Graham; Garofalakis, Minos; Muthukrishnan, Shanmugavelayutham; Rastogi, Rajeev.

    In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 01.12.2005, p. 25-36.

    Research output: Contribution to journalConference article

    Cormode, Graham ; Garofalakis, Minos ; Muthukrishnan, Shanmugavelayutham ; Rastogi, Rajeev. / Holistic aggregates in a networked world : Distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2005 ; pp. 25-36.
    @article{5b1371b160424f13a7bbbc9b11b75cf9,
    title = "Holistic aggregates in a networked world: Distributed tracking of approximate quantiles",
    abstract = "While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting - our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., {"}heavy-hitters{"} queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.",
    author = "Graham Cormode and Minos Garofalakis and Shanmugavelayutham Muthukrishnan and Rajeev Rastogi",
    year = "2005",
    month = "12",
    day = "1",
    doi = "10.1145/1066157.1066161",
    language = "English (US)",
    pages = "25--36",
    journal = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
    issn = "0730-8078",
    publisher = "Association for Computing Machinery (ACM)",

    }

    TY - JOUR

    T1 - Holistic aggregates in a networked world

    T2 - Distributed tracking of approximate quantiles

    AU - Cormode, Graham

    AU - Garofalakis, Minos

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Rastogi, Rajeev

    PY - 2005/12/1

    Y1 - 2005/12/1

    N2 - While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting - our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., "heavy-hitters" queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.

    AB - While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting - our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., "heavy-hitters" queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.

    UR - http://www.scopus.com/inward/record.url?scp=29844457932&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=29844457932&partnerID=8YFLogxK

    U2 - 10.1145/1066157.1066161

    DO - 10.1145/1066157.1066161

    M3 - Conference article

    AN - SCOPUS:29844457932

    SP - 25

    EP - 36

    JO - Proceedings of the ACM SIGMOD International Conference on Management of Data

    JF - Proceedings of the ACM SIGMOD International Conference on Management of Data

    SN - 0730-8078

    ER -