Mining deviants in time series data streams

Shanmugavelayutham Muthukrishnan, Rahul Shah, Jeffrey Scott Vitter

    Research output: Contribution to journalConference article

    Abstract

    One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of "deviants" from Jagadish et al [19] as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem - uni- vs multivariate time series, optimal vs near-optimal vs heuristic solutions, offline vs streaming - our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.

    Original languageEnglish (US)
    Pages (from-to)41-50
    Number of pages10
    JournalProceedings of the International Conference on Scientific and Statistical Database Management, SSDBM
    Volume16
    StatePublished - Oct 25 2004
    EventProceedings - 16th International Conference on Scientific and Statistical Databse Management, SSDBM 2004 - Santorini Island, Greece
    Duration: Jun 21 2004Jun 23 2004

    Fingerprint

    Time Series Data
    Data Streams
    Outlier
    Time series
    Mining
    Multivariate Time Series
    Network Traffic
    Synthetic Data
    Aberrations
    Aberration
    Streaming
    Sparsity
    Instant
    Standard deviation
    Data mining
    Monitor
    Statistics
    Heuristics
    Monitoring
    Metric

    ASJC Scopus subject areas

    • Software
    • Applied Mathematics

    Cite this

    Mining deviants in time series data streams. / Muthukrishnan, Shanmugavelayutham; Shah, Rahul; Vitter, Jeffrey Scott.

    In: Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM, Vol. 16, 25.10.2004, p. 41-50.

    Research output: Contribution to journalConference article

    Muthukrishnan, Shanmugavelayutham ; Shah, Rahul ; Vitter, Jeffrey Scott. / Mining deviants in time series data streams. In: Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM. 2004 ; Vol. 16. pp. 41-50.
    @article{ae1a8650a65c438785ee1024ec2e2fa5,
    title = "Mining deviants in time series data streams",
    abstract = "One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of {"}deviants{"} from Jagadish et al [19] as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem - uni- vs multivariate time series, optimal vs near-optimal vs heuristic solutions, offline vs streaming - our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.",
    author = "Shanmugavelayutham Muthukrishnan and Rahul Shah and Vitter, {Jeffrey Scott}",
    year = "2004",
    month = "10",
    day = "25",
    language = "English (US)",
    volume = "16",
    pages = "41--50",
    journal = "Scientific and Statistical Database Management - Proceedings of the International Working Conference",
    issn = "1099-3371",
    publisher = "IEEE Computer Society",

    }

    TY - JOUR

    T1 - Mining deviants in time series data streams

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Shah, Rahul

    AU - Vitter, Jeffrey Scott

    PY - 2004/10/25

    Y1 - 2004/10/25

    N2 - One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of "deviants" from Jagadish et al [19] as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem - uni- vs multivariate time series, optimal vs near-optimal vs heuristic solutions, offline vs streaming - our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.

    AB - One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of "deviants" from Jagadish et al [19] as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem - uni- vs multivariate time series, optimal vs near-optimal vs heuristic solutions, offline vs streaming - our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.

    UR - http://www.scopus.com/inward/record.url?scp=5444269989&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=5444269989&partnerID=8YFLogxK

    M3 - Conference article

    AN - SCOPUS:5444269989

    VL - 16

    SP - 41

    EP - 50

    JO - Scientific and Statistical Database Management - Proceedings of the International Working Conference

    JF - Scientific and Statistical Database Management - Proceedings of the International Working Conference

    SN - 1099-3371

    ER -