Query-aware sampling for data streams

Theodore Johnson, Shanmugavelayutham Muthukrishnan, Vladislav Shkapenyuk, Oliver Spatscheck

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Data Stream Management Systems are useful when large volumes of data need to be processed in real time. Examples include monitoring network traffic, monitoring financial transactions, and analyzing large scale scientific data feeds. These applications have varying data rates and often show bursts of high activity that overload the system, often during the most critical instants (e.g., network attacks, financial spikes) for analysis. Therefore, load shedding is necessary to preserve the stability of the system, gracefully degrade its performance and extract answers. Existing methods for load shedding in a general purpose data stream query system use random sampling of tuples, essentially independent of the query. While this technique is acceptable for some queries, the results may be meaningless or even incorrect for other queries. In principle, a number of different query-dependent sampling methods exist, but they work only for particular queries. In this paper, we show how to perform query-aware sampling (semantic sampling) which works in general. We present methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set. We conclude with experiments on a high-speed data stream that demonstrate with different query sets that our method produces accurate results while decreasing the load significantly.

    Original languageEnglish (US)
    Title of host publicationWorkshops in Conjunction with the International Conference on Data Engineering - ICDE' 07
    Pages664-673
    Number of pages10
    DOIs
    StatePublished - Dec 1 2007
    EventWorkshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007 - Istanbul, Turkey
    Duration: Apr 15 2007Apr 20 2007

    Publication series

    NameProceedings - International Conference on Data Engineering
    ISSN (Print)1084-4627

    Conference

    ConferenceWorkshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007
    CountryTurkey
    CityIstanbul
    Period4/15/074/20/07

    Fingerprint

    Sampling
    Monitoring
    Semantics
    Experiments

    ASJC Scopus subject areas

    • Software
    • Signal Processing
    • Information Systems

    Cite this

    Johnson, T., Muthukrishnan, S., Shkapenyuk, V., & Spatscheck, O. (2007). Query-aware sampling for data streams. In Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07 (pp. 664-673). [4401053] (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDEW.2007.4401053

    Query-aware sampling for data streams. / Johnson, Theodore; Muthukrishnan, Shanmugavelayutham; Shkapenyuk, Vladislav; Spatscheck, Oliver.

    Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07. 2007. p. 664-673 4401053 (Proceedings - International Conference on Data Engineering).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Johnson, T, Muthukrishnan, S, Shkapenyuk, V & Spatscheck, O 2007, Query-aware sampling for data streams. in Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07., 4401053, Proceedings - International Conference on Data Engineering, pp. 664-673, Workshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007, Istanbul, Turkey, 4/15/07. https://doi.org/10.1109/ICDEW.2007.4401053
    Johnson T, Muthukrishnan S, Shkapenyuk V, Spatscheck O. Query-aware sampling for data streams. In Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07. 2007. p. 664-673. 4401053. (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDEW.2007.4401053
    Johnson, Theodore ; Muthukrishnan, Shanmugavelayutham ; Shkapenyuk, Vladislav ; Spatscheck, Oliver. / Query-aware sampling for data streams. Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07. 2007. pp. 664-673 (Proceedings - International Conference on Data Engineering).
    @inproceedings{c46f287c704c4ad08d707bff26ddaeb5,
    title = "Query-aware sampling for data streams",
    abstract = "Data Stream Management Systems are useful when large volumes of data need to be processed in real time. Examples include monitoring network traffic, monitoring financial transactions, and analyzing large scale scientific data feeds. These applications have varying data rates and often show bursts of high activity that overload the system, often during the most critical instants (e.g., network attacks, financial spikes) for analysis. Therefore, load shedding is necessary to preserve the stability of the system, gracefully degrade its performance and extract answers. Existing methods for load shedding in a general purpose data stream query system use random sampling of tuples, essentially independent of the query. While this technique is acceptable for some queries, the results may be meaningless or even incorrect for other queries. In principle, a number of different query-dependent sampling methods exist, but they work only for particular queries. In this paper, we show how to perform query-aware sampling (semantic sampling) which works in general. We present methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set. We conclude with experiments on a high-speed data stream that demonstrate with different query sets that our method produces accurate results while decreasing the load significantly.",
    author = "Theodore Johnson and Shanmugavelayutham Muthukrishnan and Vladislav Shkapenyuk and Oliver Spatscheck",
    year = "2007",
    month = "12",
    day = "1",
    doi = "10.1109/ICDEW.2007.4401053",
    language = "English (US)",
    isbn = "1424408326",
    series = "Proceedings - International Conference on Data Engineering",
    pages = "664--673",
    booktitle = "Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07",

    }

    TY - GEN

    T1 - Query-aware sampling for data streams

    AU - Johnson, Theodore

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Shkapenyuk, Vladislav

    AU - Spatscheck, Oliver

    PY - 2007/12/1

    Y1 - 2007/12/1

    N2 - Data Stream Management Systems are useful when large volumes of data need to be processed in real time. Examples include monitoring network traffic, monitoring financial transactions, and analyzing large scale scientific data feeds. These applications have varying data rates and often show bursts of high activity that overload the system, often during the most critical instants (e.g., network attacks, financial spikes) for analysis. Therefore, load shedding is necessary to preserve the stability of the system, gracefully degrade its performance and extract answers. Existing methods for load shedding in a general purpose data stream query system use random sampling of tuples, essentially independent of the query. While this technique is acceptable for some queries, the results may be meaningless or even incorrect for other queries. In principle, a number of different query-dependent sampling methods exist, but they work only for particular queries. In this paper, we show how to perform query-aware sampling (semantic sampling) which works in general. We present methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set. We conclude with experiments on a high-speed data stream that demonstrate with different query sets that our method produces accurate results while decreasing the load significantly.

    AB - Data Stream Management Systems are useful when large volumes of data need to be processed in real time. Examples include monitoring network traffic, monitoring financial transactions, and analyzing large scale scientific data feeds. These applications have varying data rates and often show bursts of high activity that overload the system, often during the most critical instants (e.g., network attacks, financial spikes) for analysis. Therefore, load shedding is necessary to preserve the stability of the system, gracefully degrade its performance and extract answers. Existing methods for load shedding in a general purpose data stream query system use random sampling of tuples, essentially independent of the query. While this technique is acceptable for some queries, the results may be meaningless or even incorrect for other queries. In principle, a number of different query-dependent sampling methods exist, but they work only for particular queries. In this paper, we show how to perform query-aware sampling (semantic sampling) which works in general. We present methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set. We conclude with experiments on a high-speed data stream that demonstrate with different query sets that our method produces accurate results while decreasing the load significantly.

    UR - http://www.scopus.com/inward/record.url?scp=48349088657&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=48349088657&partnerID=8YFLogxK

    U2 - 10.1109/ICDEW.2007.4401053

    DO - 10.1109/ICDEW.2007.4401053

    M3 - Conference contribution

    AN - SCOPUS:48349088657

    SN - 1424408326

    SN - 9781424408320

    T3 - Proceedings - International Conference on Data Engineering

    SP - 664

    EP - 673

    BT - Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07

    ER -