Continuous sampling from distributed streams

Graham Cormode, S. Muthukrishnan, Ke Yi, Qin Zhang

Research output: Contribution to journalArticle

Abstract

A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The main challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol on the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this article, we present communication-efficient protocols for continuously maintaining a sample (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent elements, or arrivals within the last w time units. We show that our protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.

Original languageEnglish (US)
Article number2160163
JournalJournal of the ACM
Volume59
Issue number2
DOIs
StatePublished - Apr 1 2012
Externally publishedYes

Keywords

  • Distributed tracking
  • Random sampling

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Information Systems
  • Hardware and Architecture
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Continuous sampling from distributed streams'. Together they form a unique fingerprint.

  • Cite this

    Cormode, G., Muthukrishnan, S., Yi, K., & Zhang, Q. (2012). Continuous sampling from distributed streams. Journal of the ACM, 59(2), [2160163]. https://doi.org/10.1145/2160158.2160163