Drizzle

Fast and Adaptable Stream Processing at Scale

Shivaram Venkataraman, Michael Armbrust, Aurojit Panda, Ali Ghodsi, Kay Ousterhout, Michael J. Franklin, Benjamin Recht, Ion Stoica

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2–3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.

Original languageEnglish (US)
Title of host publicationSOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles
PublisherAssociation for Computing Machinery, Inc
Pages374-389
Number of pages16
ISBN (Electronic)9781450350853
DOIs
StatePublished - Oct 14 2017
Event26th ACM Symposium on Operating Systems Principles, SOSP 2017 - Shanghai, China
Duration: Oct 28 2017Oct 31 2017

Other

Other26th ACM Symposium on Operating Systems Principles, SOSP 2017
CountryChina
CityShanghai
Period10/28/1710/31/17

Fingerprint

Electric sparks
Processing
Throughput
Recovery
Fault tolerance
Experiments

Keywords

  • Performance
  • Reliability
  • Stream Processing

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this

Venkataraman, S., Armbrust, M., Panda, A., Ghodsi, A., Ousterhout, K., Franklin, M. J., ... Stoica, I. (2017). Drizzle: Fast and Adaptable Stream Processing at Scale. In SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles (pp. 374-389). Association for Computing Machinery, Inc. https://doi.org/10.1145/3132747.3132750

Drizzle : Fast and Adaptable Stream Processing at Scale. / Venkataraman, Shivaram; Armbrust, Michael; Panda, Aurojit; Ghodsi, Ali; Ousterhout, Kay; Franklin, Michael J.; Recht, Benjamin; Stoica, Ion.

SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles. Association for Computing Machinery, Inc, 2017. p. 374-389.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Venkataraman, S, Armbrust, M, Panda, A, Ghodsi, A, Ousterhout, K, Franklin, MJ, Recht, B & Stoica, I 2017, Drizzle: Fast and Adaptable Stream Processing at Scale. in SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles. Association for Computing Machinery, Inc, pp. 374-389, 26th ACM Symposium on Operating Systems Principles, SOSP 2017, Shanghai, China, 10/28/17. https://doi.org/10.1145/3132747.3132750
Venkataraman S, Armbrust M, Panda A, Ghodsi A, Ousterhout K, Franklin MJ et al. Drizzle: Fast and Adaptable Stream Processing at Scale. In SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles. Association for Computing Machinery, Inc. 2017. p. 374-389 https://doi.org/10.1145/3132747.3132750
Venkataraman, Shivaram ; Armbrust, Michael ; Panda, Aurojit ; Ghodsi, Ali ; Ousterhout, Kay ; Franklin, Michael J. ; Recht, Benjamin ; Stoica, Ion. / Drizzle : Fast and Adaptable Stream Processing at Scale. SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles. Association for Computing Machinery, Inc, 2017. pp. 374-389
@inproceedings{3b2201c67ccc45f7b1900b7d419ea139,
title = "Drizzle: Fast and Adaptable Stream Processing at Scale",
abstract = "Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2–3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.",
keywords = "Performance, Reliability, Stream Processing",
author = "Shivaram Venkataraman and Michael Armbrust and Aurojit Panda and Ali Ghodsi and Kay Ousterhout and Franklin, {Michael J.} and Benjamin Recht and Ion Stoica",
year = "2017",
month = "10",
day = "14",
doi = "10.1145/3132747.3132750",
language = "English (US)",
pages = "374--389",
booktitle = "SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Drizzle

T2 - Fast and Adaptable Stream Processing at Scale

AU - Venkataraman, Shivaram

AU - Armbrust, Michael

AU - Panda, Aurojit

AU - Ghodsi, Ali

AU - Ousterhout, Kay

AU - Franklin, Michael J.

AU - Recht, Benjamin

AU - Stoica, Ion

PY - 2017/10/14

Y1 - 2017/10/14

N2 - Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2–3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.

AB - Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2–3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.

KW - Performance

KW - Reliability

KW - Stream Processing

UR - http://www.scopus.com/inward/record.url?scp=85041649686&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041649686&partnerID=8YFLogxK

U2 - 10.1145/3132747.3132750

DO - 10.1145/3132747.3132750

M3 - Conference contribution

SP - 374

EP - 389

BT - SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles

PB - Association for Computing Machinery, Inc

ER -