Finding hierarchical heavy hitters in streaming data

Graham Cormode, Flip Korn, Shanmugavelayutham Muthukrishnan, Divesh Srivastava

    Research output: Contribution to journalArticle

    Abstract

    Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (heavy hitters) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input. In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a child node can have multiple parent nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions. We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.

    Original languageEnglish (US)
    Article number16
    JournalACM Transactions on Knowledge Discovery from Data
    Volume1
    Issue number4
    DOIs
    StatePublished - Jan 1 2008

    Fingerprint

    Data structures
    Visualization
    Semantics
    Statistics
    Monitoring

    Keywords

    • Approximation algorithms
    • Data mining
    • Network data analysis

    ASJC Scopus subject areas

    • Computer Science(all)

    Cite this

    Finding hierarchical heavy hitters in streaming data. / Cormode, Graham; Korn, Flip; Muthukrishnan, Shanmugavelayutham; Srivastava, Divesh.

    In: ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 4, 16, 01.01.2008.

    Research output: Contribution to journalArticle

    Cormode, Graham ; Korn, Flip ; Muthukrishnan, Shanmugavelayutham ; Srivastava, Divesh. / Finding hierarchical heavy hitters in streaming data. In: ACM Transactions on Knowledge Discovery from Data. 2008 ; Vol. 1, No. 4.
    @article{cc5ff6bc656447958ecd1c39a6f0003e,
    title = "Finding hierarchical heavy hitters in streaming data",
    abstract = "Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (heavy hitters) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input. In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a child node can have multiple parent nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions. We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.",
    keywords = "Approximation algorithms, Data mining, Network data analysis",
    author = "Graham Cormode and Flip Korn and Shanmugavelayutham Muthukrishnan and Divesh Srivastava",
    year = "2008",
    month = "1",
    day = "1",
    doi = "10.1145/1324172.1324174",
    language = "English (US)",
    volume = "1",
    journal = "ACM Transactions on Knowledge Discovery from Data",
    issn = "1556-4681",
    publisher = "Association for Computing Machinery (ACM)",
    number = "4",

    }

    TY - JOUR

    T1 - Finding hierarchical heavy hitters in streaming data

    AU - Cormode, Graham

    AU - Korn, Flip

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Srivastava, Divesh

    PY - 2008/1/1

    Y1 - 2008/1/1

    N2 - Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (heavy hitters) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input. In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a child node can have multiple parent nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions. We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.

    AB - Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (heavy hitters) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input. In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a child node can have multiple parent nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions. We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.

    KW - Approximation algorithms

    KW - Data mining

    KW - Network data analysis

    UR - http://www.scopus.com/inward/record.url?scp=39149089260&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=39149089260&partnerID=8YFLogxK

    U2 - 10.1145/1324172.1324174

    DO - 10.1145/1324172.1324174

    M3 - Article

    VL - 1

    JO - ACM Transactions on Knowledge Discovery from Data

    JF - ACM Transactions on Knowledge Discovery from Data

    SN - 1556-4681

    IS - 4

    M1 - 16

    ER -