Modeling skew in data streams

Flip Korn, Shanmugavelayutham Muthukrishnan, Yihua Wu

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical - how to continually find fast and reliable parameter estimates on high speed streams of skewed data using small space - and conceptual - how to validate the goodness-of-fit and stability of the model online.In this paper, we show how to fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits least-squares straight lines; it yields algorithms that are fast, space-efficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodness-of-fit measurements on a data stream; we adapt the statistical testing technique of examining the quantile-quantile (q-q) plot, to perform online model validation at streaming speeds.As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T's Gigascope data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling.

    Original languageEnglish (US)
    Title of host publicationSIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data
    Pages181-192
    Number of pages12
    DOIs
    StatePublished - Dec 1 2006
    Event2006 ACM SIGMOD International Conference on Management of Data - Chicago, IL, United States
    Duration: Jun 27 2006Jun 29 2006

    Publication series

    NameProceedings of the ACM SIGMOD International Conference on Management of Data
    ISSN (Print)0730-8078

    Other

    Other2006 ACM SIGMOD International Conference on Management of Data
    CountryUnited States
    CityChicago, IL
    Period6/27/066/29/06

    Fingerprint

    Intrusion detection
    Concretes
    Testing
    Experiments

    Keywords

    • Estimation
    • Modeling
    • Skew
    • Streaming algorithms

    ASJC Scopus subject areas

    • Software
    • Information Systems

    Cite this

    Korn, F., Muthukrishnan, S., & Wu, Y. (2006). Modeling skew in data streams. In SIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 181-192). (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/1142473.1142495

    Modeling skew in data streams. / Korn, Flip; Muthukrishnan, Shanmugavelayutham; Wu, Yihua.

    SIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data. 2006. p. 181-192 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Korn, F, Muthukrishnan, S & Wu, Y 2006, Modeling skew in data streams. in SIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 181-192, 2006 ACM SIGMOD International Conference on Management of Data, Chicago, IL, United States, 6/27/06. https://doi.org/10.1145/1142473.1142495
    Korn F, Muthukrishnan S, Wu Y. Modeling skew in data streams. In SIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data. 2006. p. 181-192. (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/1142473.1142495
    Korn, Flip ; Muthukrishnan, Shanmugavelayutham ; Wu, Yihua. / Modeling skew in data streams. SIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data. 2006. pp. 181-192 (Proceedings of the ACM SIGMOD International Conference on Management of Data).
    @inproceedings{16d23bdee3484224b7eec8e740065e2e,
    title = "Modeling skew in data streams",
    abstract = "Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical - how to continually find fast and reliable parameter estimates on high speed streams of skewed data using small space - and conceptual - how to validate the goodness-of-fit and stability of the model online.In this paper, we show how to fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits least-squares straight lines; it yields algorithms that are fast, space-efficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodness-of-fit measurements on a data stream; we adapt the statistical testing technique of examining the quantile-quantile (q-q) plot, to perform online model validation at streaming speeds.As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T's Gigascope data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling.",
    keywords = "Estimation, Modeling, Skew, Streaming algorithms",
    author = "Flip Korn and Shanmugavelayutham Muthukrishnan and Yihua Wu",
    year = "2006",
    month = "12",
    day = "1",
    doi = "10.1145/1142473.1142495",
    language = "English (US)",
    isbn = "1595934340",
    series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
    pages = "181--192",
    booktitle = "SIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data",

    }

    TY - GEN

    T1 - Modeling skew in data streams

    AU - Korn, Flip

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Wu, Yihua

    PY - 2006/12/1

    Y1 - 2006/12/1

    N2 - Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical - how to continually find fast and reliable parameter estimates on high speed streams of skewed data using small space - and conceptual - how to validate the goodness-of-fit and stability of the model online.In this paper, we show how to fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits least-squares straight lines; it yields algorithms that are fast, space-efficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodness-of-fit measurements on a data stream; we adapt the statistical testing technique of examining the quantile-quantile (q-q) plot, to perform online model validation at streaming speeds.As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T's Gigascope data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling.

    AB - Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical - how to continually find fast and reliable parameter estimates on high speed streams of skewed data using small space - and conceptual - how to validate the goodness-of-fit and stability of the model online.In this paper, we show how to fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits least-squares straight lines; it yields algorithms that are fast, space-efficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodness-of-fit measurements on a data stream; we adapt the statistical testing technique of examining the quantile-quantile (q-q) plot, to perform online model validation at streaming speeds.As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T's Gigascope data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling.

    KW - Estimation

    KW - Modeling

    KW - Skew

    KW - Streaming algorithms

    UR - http://www.scopus.com/inward/record.url?scp=34250648954&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=34250648954&partnerID=8YFLogxK

    U2 - 10.1145/1142473.1142495

    DO - 10.1145/1142473.1142495

    M3 - Conference contribution

    SN - 1595934340

    SN - 9781595934345

    T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

    SP - 181

    EP - 192

    BT - SIGMOD 2006 - Proceedings of the ACM SIGMOD International Conference on Management of Data

    ER -