Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Graham Cormode, Shanmugavelayutham Muthukrishnan, Irina Rozenbaum

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be "viewed" in different ways. A data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of occurrences of x in the stream, or as its inverse, f -1(i), which is the number of items that appear i times. While both such "views" are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.

    Original languageEnglish (US)
    Title of host publicationVLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases
    Pages25-36
    Number of pages12
    StatePublished - Dec 1 2005
    EventVLDB 2005 - 31st International Conference on Very Large Data Bases - Trondheim, Norway
    Duration: Aug 30 2005Sep 2 2005

    Publication series

    NameVLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases
    Volume1

    Other

    OtherVLDB 2005 - 31st International Conference on Very Large Data Bases
    CountryNorway
    CityTrondheim
    Period8/30/059/2/05

    Fingerprint

    Sampling
    Monitoring

    ASJC Scopus subject areas

    • Engineering(all)

    Cite this

    Cormode, G., Muthukrishnan, S., & Rozenbaum, I. (2005). Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases (pp. 25-36). (VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases; Vol. 1).

    Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. / Cormode, Graham; Muthukrishnan, Shanmugavelayutham; Rozenbaum, Irina.

    VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases. 2005. p. 25-36 (VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases; Vol. 1).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Cormode, G, Muthukrishnan, S & Rozenbaum, I 2005, Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. in VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases. VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases, vol. 1, pp. 25-36, VLDB 2005 - 31st International Conference on Very Large Data Bases, Trondheim, Norway, 8/30/05.
    Cormode G, Muthukrishnan S, Rozenbaum I. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases. 2005. p. 25-36. (VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases).
    Cormode, Graham ; Muthukrishnan, Shanmugavelayutham ; Rozenbaum, Irina. / Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases. 2005. pp. 25-36 (VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases).
    @inproceedings{a09803ba65a942d9812480a8ab53b005,
    title = "Summarizing and mining inverse distributions on data streams via dynamic inverse sampling",
    abstract = "Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be {"}viewed{"} in different ways. A data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of occurrences of x in the stream, or as its inverse, f -1(i), which is the number of items that appear i times. While both such {"}views{"} are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.",
    author = "Graham Cormode and Shanmugavelayutham Muthukrishnan and Irina Rozenbaum",
    year = "2005",
    month = "12",
    day = "1",
    language = "English (US)",
    isbn = "1595931546",
    series = "VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases",
    pages = "25--36",
    booktitle = "VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases",

    }

    TY - GEN

    T1 - Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

    AU - Cormode, Graham

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Rozenbaum, Irina

    PY - 2005/12/1

    Y1 - 2005/12/1

    N2 - Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be "viewed" in different ways. A data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of occurrences of x in the stream, or as its inverse, f -1(i), which is the number of items that appear i times. While both such "views" are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.

    AB - Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be "viewed" in different ways. A data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of occurrences of x in the stream, or as its inverse, f -1(i), which is the number of items that appear i times. While both such "views" are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.

    UR - http://www.scopus.com/inward/record.url?scp=33745615174&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=33745615174&partnerID=8YFLogxK

    M3 - Conference contribution

    SN - 1595931546

    SN - 9781595931542

    T3 - VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases

    SP - 25

    EP - 36

    BT - VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases

    ER -