### Abstract

A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals I_{j}, assigning a value b_{j} to I_{j}, and approximating A _{i} by H_{i} = b_{j} for i ∈ I_{j}. An optimal histogram representation H_{opt} consists of the choices of I_{j} and b_{j} that minimize the sum-square-error ∥A - H∥_{2}^{2} = ∑_{i}|A_{i}-H _{i}|^{2}. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates H_{opt} and outputs a histogram H such that ∥A -H∥_{2}^{2}≤ (1 + ε) ∥A -H_{opt}∥_{2}^{2} Our algorithm considers the data items A_{0},A_{1},.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log ^{Ω(1)}(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

Original language | English (US) |
---|---|

Title of host publication | Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings |

Pages | 681-692 |

Number of pages | 12 |

State | Published - Dec 1 2002 |

Event | 29th International Colloquium on Automata, Languages, and Programming, ICALP 2002 - Malaga, Spain Duration: Jul 8 2002 → Jul 13 2002 |

### Publication series

Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|

Volume | 2380 LNCS |

ISSN (Print) | 0302-9743 |

ISSN (Electronic) | 1611-3349 |

### Other

Other | 29th International Colloquium on Automata, Languages, and Programming, ICALP 2002 |
---|---|

Country | Spain |

City | Malaga |

Period | 7/8/02 → 7/13/02 |

### Fingerprint

### Keywords

- Histograms
- Streaming algorithms

### ASJC Scopus subject areas

- Theoretical Computer Science
- Computer Science(all)

### Cite this

*Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings*(pp. 681-692). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2380 LNCS).

**Histogramming data streams with fast per-item processing.** / Guha, Sudipto; Indyk, Piotr; Muthukrishnan, Shanmugavelayutham; Strauss, Martin J.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings.*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2380 LNCS, pp. 681-692, 29th International Colloquium on Automata, Languages, and Programming, ICALP 2002, Malaga, Spain, 7/8/02.

}

TY - GEN

T1 - Histogramming data streams with fast per-item processing

AU - Guha, Sudipto

AU - Indyk, Piotr

AU - Muthukrishnan, Shanmugavelayutham

AU - Strauss, Martin J.

PY - 2002/12/1

Y1 - 2002/12/1

N2 - A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating A i by Hi = bj for i ∈ Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ∥A - H∥22 = ∑i|Ai-H i|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that ∥A -H∥22≤ (1 + ε) ∥A -Hopt∥22 Our algorithm considers the data items A0,A1,.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log Ω(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

AB - A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating A i by Hi = bj for i ∈ Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ∥A - H∥22 = ∑i|Ai-H i|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that ∥A -H∥22≤ (1 + ε) ∥A -Hopt∥22 Our algorithm considers the data items A0,A1,.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log Ω(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

KW - Histograms

KW - Streaming algorithms

UR - http://www.scopus.com/inward/record.url?scp=84869198292&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84869198292&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84869198292

SN - 3540438645

SN - 9783540438649

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 681

EP - 692

BT - Automata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings

ER -