How to scalably and accurately skip past streams

Supratik Bhattacharyya, André Madeira, Shanmugavelayutham Muthukrishnan, Tao Ye

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Data stream methods look at each new item of the stream, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is extremely critical and not every item can be extensively examined. In practice, this has been addressed by only examining every Nth item from the input; decreasing the input rate by a fraction 1/N, but resulting in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams and looking at only a fraction of the input. Unlike traditional methods, our skipping is performed in a principled manner based on the "norm" of the stream seen. Using this technique on top of well-known sketches, we show several-fold improvement in the update time for processing streams with a given guaranteed accuracy, for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint's Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, aiming at future scalable stream processing systems and going beyond state-of-art packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.

    Original languageEnglish (US)
    Title of host publicationWorkshops in Conjunction with the International Conference on Data Engineering - ICDE' 07
    Pages654-663
    Number of pages10
    DOIs
    StatePublished - Dec 1 2007
    EventWorkshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007 - Istanbul, Turkey
    Duration: Apr 15 2007Apr 20 2007

    Publication series

    NameProceedings - International Conference on Data Engineering
    ISSN (Print)1084-4627

    Conference

    ConferenceWorkshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007
    CountryTurkey
    CityIstanbul
    Period4/15/074/20/07

    Fingerprint

    Processing
    Data storage equipment
    Monitoring

    ASJC Scopus subject areas

    • Software
    • Signal Processing
    • Information Systems

    Cite this

    Bhattacharyya, S., Madeira, A., Muthukrishnan, S., & Ye, T. (2007). How to scalably and accurately skip past streams. In Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07 (pp. 654-663). [4401052] (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDEW.2007.4401052

    How to scalably and accurately skip past streams. / Bhattacharyya, Supratik; Madeira, André; Muthukrishnan, Shanmugavelayutham; Ye, Tao.

    Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07. 2007. p. 654-663 4401052 (Proceedings - International Conference on Data Engineering).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Bhattacharyya, S, Madeira, A, Muthukrishnan, S & Ye, T 2007, How to scalably and accurately skip past streams. in Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07., 4401052, Proceedings - International Conference on Data Engineering, pp. 654-663, Workshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007, Istanbul, Turkey, 4/15/07. https://doi.org/10.1109/ICDEW.2007.4401052
    Bhattacharyya S, Madeira A, Muthukrishnan S, Ye T. How to scalably and accurately skip past streams. In Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07. 2007. p. 654-663. 4401052. (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDEW.2007.4401052
    Bhattacharyya, Supratik ; Madeira, André ; Muthukrishnan, Shanmugavelayutham ; Ye, Tao. / How to scalably and accurately skip past streams. Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07. 2007. pp. 654-663 (Proceedings - International Conference on Data Engineering).
    @inproceedings{f907d3e32ab34d87b9bc2f9738ff9752,
    title = "How to scalably and accurately skip past streams",
    abstract = "Data stream methods look at each new item of the stream, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is extremely critical and not every item can be extensively examined. In practice, this has been addressed by only examining every Nth item from the input; decreasing the input rate by a fraction 1/N, but resulting in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams and looking at only a fraction of the input. Unlike traditional methods, our skipping is performed in a principled manner based on the {"}norm{"} of the stream seen. Using this technique on top of well-known sketches, we show several-fold improvement in the update time for processing streams with a given guaranteed accuracy, for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint's Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, aiming at future scalable stream processing systems and going beyond state-of-art packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.",
    author = "Supratik Bhattacharyya and Andr{\'e} Madeira and Shanmugavelayutham Muthukrishnan and Tao Ye",
    year = "2007",
    month = "12",
    day = "1",
    doi = "10.1109/ICDEW.2007.4401052",
    language = "English (US)",
    isbn = "1424408326",
    series = "Proceedings - International Conference on Data Engineering",
    pages = "654--663",
    booktitle = "Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07",

    }

    TY - GEN

    T1 - How to scalably and accurately skip past streams

    AU - Bhattacharyya, Supratik

    AU - Madeira, André

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Ye, Tao

    PY - 2007/12/1

    Y1 - 2007/12/1

    N2 - Data stream methods look at each new item of the stream, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is extremely critical and not every item can be extensively examined. In practice, this has been addressed by only examining every Nth item from the input; decreasing the input rate by a fraction 1/N, but resulting in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams and looking at only a fraction of the input. Unlike traditional methods, our skipping is performed in a principled manner based on the "norm" of the stream seen. Using this technique on top of well-known sketches, we show several-fold improvement in the update time for processing streams with a given guaranteed accuracy, for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint's Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, aiming at future scalable stream processing systems and going beyond state-of-art packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.

    AB - Data stream methods look at each new item of the stream, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is extremely critical and not every item can be extensively examined. In practice, this has been addressed by only examining every Nth item from the input; decreasing the input rate by a fraction 1/N, but resulting in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams and looking at only a fraction of the input. Unlike traditional methods, our skipping is performed in a principled manner based on the "norm" of the stream seen. Using this technique on top of well-known sketches, we show several-fold improvement in the update time for processing streams with a given guaranteed accuracy, for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint's Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, aiming at future scalable stream processing systems and going beyond state-of-art packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.

    UR - http://www.scopus.com/inward/record.url?scp=48349093467&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=48349093467&partnerID=8YFLogxK

    U2 - 10.1109/ICDEW.2007.4401052

    DO - 10.1109/ICDEW.2007.4401052

    M3 - Conference contribution

    AN - SCOPUS:48349093467

    SN - 1424408326

    SN - 9781424408320

    T3 - Proceedings - International Conference on Data Engineering

    SP - 654

    EP - 663

    BT - Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07

    ER -