Improving availability in distributed systems with failure informers

Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, Michael Walfish

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper addresses a core question in distributed systems: how should applications be notified of failures? When a distributed system acts on failure reports, the system's correctness and availability depend on the granularity and semantics of those reports. The system's availability also depends on coverage (failures are reported), accuracy (reports are justified), and timeliness (reports come quickly). This paper describes Pigeon, a failure reporting service designed to enable high availability in the applications that use it. Pigeon exposes a new abstraction, called a failure informer, which allows applications to take informed, application-specific recovery actions, and which encapsulates uncertainty, allowing applications to proceed safely in the presence of doubt. Pigeon also significantly improves over the previous state of the art in the three-way trade-off among coverage, accuracy, and timeliness.

Original languageEnglish (US)
Title of host publicationProceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013
PublisherUSENIX Association
Pages427-441
Number of pages15
ISBN (Electronic)9781931971003
StatePublished - Jan 1 2019
Event10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013 - Lombard, United States
Duration: Apr 2 2013Apr 5 2013

Publication series

NameProceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013

Conference

Conference10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013
CountryUnited States
CityLombard
Period4/2/134/5/13

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Improving availability in distributed systems with failure informers'. Together they form a unique fingerprint.

  • Cite this

    Leners, J. B., Gupta, T., Aguilera, M. K., & Walfish, M. (2019). Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013 (pp. 427-441). (Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013). USENIX Association.