Abstract
Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and sampling-based aggregations. Also, we show how to implement the operator in Gigascope - a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system.
Original language | English (US) |
---|---|
Pages (from-to) | 1-12 |
Number of pages | 12 |
Journal | Proceedings of the ACM SIGMOD International Conference on Management of Data |
State | Published - Dec 1 2005 |
Event | SIGMOD 2005: ACM SIGMOD International Conference on Management of Data - Baltimore, MD, United States Duration: Jun 14 2005 → Jun 16 2005 |
Fingerprint
ASJC Scopus subject areas
- Software
- Information Systems
Cite this
Sampling algorithms in a stream operator. / Johnson, Theodore; Muthukrishnan, Shanmugavelayutham; Rozenbaum, Irina.
In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 01.12.2005, p. 1-12.Research output: Contribution to journal › Conference article
}
TY - JOUR
T1 - Sampling algorithms in a stream operator
AU - Johnson, Theodore
AU - Muthukrishnan, Shanmugavelayutham
AU - Rozenbaum, Irina
PY - 2005/12/1
Y1 - 2005/12/1
N2 - Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and sampling-based aggregations. Also, we show how to implement the operator in Gigascope - a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system.
AB - Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and sampling-based aggregations. Also, we show how to implement the operator in Gigascope - a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system.
UR - http://www.scopus.com/inward/record.url?scp=29844452412&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=29844452412&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:29844452412
SP - 1
EP - 12
JO - Proceedings of the ACM SIGMOD International Conference on Management of Data
JF - Proceedings of the ACM SIGMOD International Conference on Management of Data
SN - 0730-8078
ER -