What's hot and what's not: Tracking most frequent items dynamically

Graham Cormode, Shanmugavelayutham Muthukrishnan

    Research output: Contribution to journalReview article

    Abstract

    Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from "group testing." They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

    Original languageEnglish (US)
    Pages (from-to)249-278
    Number of pages30
    JournalACM Transactions on Database Systems
    Volume30
    Issue number1
    DOIs
    StatePublished - Mar 1 2005

    Fingerprint

    Statistics
    Data mining
    Data structures
    Testing
    Experiments

    Keywords

    • Approximate query answering
    • Data stream processing

    ASJC Scopus subject areas

    • Information Systems

    Cite this

    What's hot and what's not : Tracking most frequent items dynamically. / Cormode, Graham; Muthukrishnan, Shanmugavelayutham.

    In: ACM Transactions on Database Systems, Vol. 30, No. 1, 01.03.2005, p. 249-278.

    Research output: Contribution to journalReview article

    Cormode, Graham ; Muthukrishnan, Shanmugavelayutham. / What's hot and what's not : Tracking most frequent items dynamically. In: ACM Transactions on Database Systems. 2005 ; Vol. 30, No. 1. pp. 249-278.
    @article{f0abd36b11684d18839115810b425bd3,
    title = "What's hot and what's not: Tracking most frequent items dynamically",
    abstract = "Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the {"}hot items{"} in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from {"}group testing.{"} They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.",
    keywords = "Approximate query answering, Data stream processing",
    author = "Graham Cormode and Shanmugavelayutham Muthukrishnan",
    year = "2005",
    month = "3",
    day = "1",
    doi = "10.1145/1061318.1061325",
    language = "English (US)",
    volume = "30",
    pages = "249--278",
    journal = "ACM Transactions on Database Systems",
    issn = "0362-5915",
    publisher = "Association for Computing Machinery (ACM)",
    number = "1",

    }

    TY - JOUR

    T1 - What's hot and what's not

    T2 - Tracking most frequent items dynamically

    AU - Cormode, Graham

    AU - Muthukrishnan, Shanmugavelayutham

    PY - 2005/3/1

    Y1 - 2005/3/1

    N2 - Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from "group testing." They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

    AB - Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from "group testing." They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

    KW - Approximate query answering

    KW - Data stream processing

    UR - http://www.scopus.com/inward/record.url?scp=23944436942&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=23944436942&partnerID=8YFLogxK

    U2 - 10.1145/1061318.1061325

    DO - 10.1145/1061318.1061325

    M3 - Review article

    AN - SCOPUS:23944436942

    VL - 30

    SP - 249

    EP - 278

    JO - ACM Transactions on Database Systems

    JF - ACM Transactions on Database Systems

    SN - 0362-5915

    IS - 1

    ER -