Histogramming data streams with fast per-item processing

Sudipto Guha, Piotr Indyk, Shanmugavelayutham Muthukrishnan, Martin J. Strauss

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating A i by Hi = bj for i ∈ Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ∥A - H∥22 = ∑i|Ai-H i|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that ∥A -H∥22≤ (1 + ε) ∥A -Hopt22 Our algorithm considers the data items A0,A1,.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log Ω(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

    Original languageEnglish (US)
    Title of host publicationAutomata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings
    Pages681-692
    Number of pages12
    StatePublished - Dec 1 2002
    Event29th International Colloquium on Automata, Languages, and Programming, ICALP 2002 - Malaga, Spain
    Duration: Jul 8 2002Jul 13 2002

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume2380 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Other

    Other29th International Colloquium on Automata, Languages, and Programming, ICALP 2002
    CountrySpain
    CityMalaga
    Period7/8/027/13/02

    Fingerprint

    Data Streams
    Histogram
    Processing
    Deterministic Algorithm
    Signal Processing
    Union
    Compression
    High Performance
    Signal processing
    Statistics
    Minimise
    Interval
    Output

    Keywords

    • Histograms
    • Streaming algorithms

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • Computer Science(all)

    Cite this

    Guha, S., Indyk, P., Muthukrishnan, S., & Strauss, M. J. (2002). Histogramming data streams with fast per-item processing. In Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings (pp. 681-692). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2380 LNCS).

    Histogramming data streams with fast per-item processing. / Guha, Sudipto; Indyk, Piotr; Muthukrishnan, Shanmugavelayutham; Strauss, Martin J.

    Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings. 2002. p. 681-692 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2380 LNCS).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Guha, S, Indyk, P, Muthukrishnan, S & Strauss, MJ 2002, Histogramming data streams with fast per-item processing. in Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2380 LNCS, pp. 681-692, 29th International Colloquium on Automata, Languages, and Programming, ICALP 2002, Malaga, Spain, 7/8/02.
    Guha S, Indyk P, Muthukrishnan S, Strauss MJ. Histogramming data streams with fast per-item processing. In Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings. 2002. p. 681-692. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
    Guha, Sudipto ; Indyk, Piotr ; Muthukrishnan, Shanmugavelayutham ; Strauss, Martin J. / Histogramming data streams with fast per-item processing. Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings. 2002. pp. 681-692 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
    @inproceedings{c330e5f1e9004ae381279590d3446772,
    title = "Histogramming data streams with fast per-item processing",
    abstract = "A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating A i by Hi = bj for i ∈ Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ∥A - H∥22 = ∑i|Ai-H i|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that ∥A -H∥22≤ (1 + ε) ∥A -Hopt∥22 Our algorithm considers the data items A0,A1,.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log Ω(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.",
    keywords = "Histograms, Streaming algorithms",
    author = "Sudipto Guha and Piotr Indyk and Shanmugavelayutham Muthukrishnan and Strauss, {Martin J.}",
    year = "2002",
    month = "12",
    day = "1",
    language = "English (US)",
    isbn = "3540438645",
    series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
    pages = "681--692",
    booktitle = "Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings",

    }

    TY - GEN

    T1 - Histogramming data streams with fast per-item processing

    AU - Guha, Sudipto

    AU - Indyk, Piotr

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Strauss, Martin J.

    PY - 2002/12/1

    Y1 - 2002/12/1

    N2 - A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating A i by Hi = bj for i ∈ Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ∥A - H∥22 = ∑i|Ai-H i|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that ∥A -H∥22≤ (1 + ε) ∥A -Hopt∥22 Our algorithm considers the data items A0,A1,.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log Ω(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

    AB - A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating A i by Hi = bj for i ∈ Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ∥A - H∥22 = ∑i|Ai-H i|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that ∥A -H∥22≤ (1 + ε) ∥A -Hopt∥22 Our algorithm considers the data items A0,A1,.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log Ω(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

    KW - Histograms

    KW - Streaming algorithms

    UR - http://www.scopus.com/inward/record.url?scp=84869198292&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84869198292&partnerID=8YFLogxK

    M3 - Conference contribution

    AN - SCOPUS:84869198292

    SN - 3540438645

    SN - 9783540438649

    T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    SP - 681

    EP - 692

    BT - Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings

    ER -