Combining tentative and definite executions for very fast i)ependable parallel computing

Zvi Kedem, K. V. Palem, A. Raghunathan, P. G. Spirakis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a general and efficient strategy for computing mtustly on unreliable parallel machines. The model of a parallel machine that we use is a CRCW PRAM with dynamic resource fluctuations: processors can fail during the computation, and may possibly bc restored later. We first introduce the notions of dejinite and tentatitie algorithms for executing a single parallel step of an ideal parallel machine on the unreliable machine. A definite algorithm is one that guarantees a correct execution of a step, while a tentative algorithm is one that is "highly likely" to produce a correct execution of a step on the unreliable machine. We show that any definite execution of one step requires Cl(log n) time on an∗processor unreliable machine, even if all the processors functioned perfectly, This implies an l(log n) slowdown for executing any non-Trivial program on the unreliable machine, provided only definite executions are used. We get around this overhead by combining tentative and definite execution schemes appropriately, to derive correct and efllcient robust executions for arbitrary PRAM programs, with expected amortized slowdown of only 0(1) for a variety of reasonable failure models. We adeve this by using a tentative algorithm to execute each of the program's steps, while using a definite algorithm to audit the execution at selected points. If the audit does not certify the execution as correct, then the execution is rolled back to a previous audit point and restarted from there. In contrast to this work, all previous results required a slowdown of Cl(log n), since they used definite algorithms only.

Original languageEnglish (US)
Title of host publicationProceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991
PublisherAssociation for Computing Machinery
Pages381-390
Number of pages10
VolumePart F130073
ISBN (Electronic)0897913973
StatePublished - Jan 3 1991
Event23rd Annual ACM Symposium on Theory of Computing, STOC 1991 - New Orleans, United States
Duration: May 5 1991May 8 1991

Other

Other23rd Annual ACM Symposium on Theory of Computing, STOC 1991
CountryUnited States
CityNew Orleans
Period5/5/915/8/91

Fingerprint

Parallel processing systems

ASJC Scopus subject areas

  • Software

Cite this

Kedem, Z., Palem, K. V., Raghunathan, A., & Spirakis, P. G. (1991). Combining tentative and definite executions for very fast i)ependable parallel computing. In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991 (Vol. Part F130073, pp. 381-390). Association for Computing Machinery.

Combining tentative and definite executions for very fast i)ependable parallel computing. / Kedem, Zvi; Palem, K. V.; Raghunathan, A.; Spirakis, P. G.

Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991. Vol. Part F130073 Association for Computing Machinery, 1991. p. 381-390.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kedem, Z, Palem, KV, Raghunathan, A & Spirakis, PG 1991, Combining tentative and definite executions for very fast i)ependable parallel computing. in Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991. vol. Part F130073, Association for Computing Machinery, pp. 381-390, 23rd Annual ACM Symposium on Theory of Computing, STOC 1991, New Orleans, United States, 5/5/91.
Kedem Z, Palem KV, Raghunathan A, Spirakis PG. Combining tentative and definite executions for very fast i)ependable parallel computing. In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991. Vol. Part F130073. Association for Computing Machinery. 1991. p. 381-390
Kedem, Zvi ; Palem, K. V. ; Raghunathan, A. ; Spirakis, P. G. / Combining tentative and definite executions for very fast i)ependable parallel computing. Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991. Vol. Part F130073 Association for Computing Machinery, 1991. pp. 381-390
@inproceedings{8a34dd7879d7491782b89cdbfd507248,
title = "Combining tentative and definite executions for very fast i)ependable parallel computing",
abstract = "We present a general and efficient strategy for computing mtustly on unreliable parallel machines. The model of a parallel machine that we use is a CRCW PRAM with dynamic resource fluctuations: processors can fail during the computation, and may possibly bc restored later. We first introduce the notions of dejinite and tentatitie algorithms for executing a single parallel step of an ideal parallel machine on the unreliable machine. A definite algorithm is one that guarantees a correct execution of a step, while a tentative algorithm is one that is {"}highly likely{"} to produce a correct execution of a step on the unreliable machine. We show that any definite execution of one step requires Cl(log n) time on an∗processor unreliable machine, even if all the processors functioned perfectly, This implies an l(log n) slowdown for executing any non-Trivial program on the unreliable machine, provided only definite executions are used. We get around this overhead by combining tentative and definite execution schemes appropriately, to derive correct and efllcient robust executions for arbitrary PRAM programs, with expected amortized slowdown of only 0(1) for a variety of reasonable failure models. We adeve this by using a tentative algorithm to execute each of the program's steps, while using a definite algorithm to audit the execution at selected points. If the audit does not certify the execution as correct, then the execution is rolled back to a previous audit point and restarted from there. In contrast to this work, all previous results required a slowdown of Cl(log n), since they used definite algorithms only.",
author = "Zvi Kedem and Palem, {K. V.} and A. Raghunathan and Spirakis, {P. G.}",
year = "1991",
month = "1",
day = "3",
language = "English (US)",
volume = "Part F130073",
pages = "381--390",
booktitle = "Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Combining tentative and definite executions for very fast i)ependable parallel computing

AU - Kedem, Zvi

AU - Palem, K. V.

AU - Raghunathan, A.

AU - Spirakis, P. G.

PY - 1991/1/3

Y1 - 1991/1/3

N2 - We present a general and efficient strategy for computing mtustly on unreliable parallel machines. The model of a parallel machine that we use is a CRCW PRAM with dynamic resource fluctuations: processors can fail during the computation, and may possibly bc restored later. We first introduce the notions of dejinite and tentatitie algorithms for executing a single parallel step of an ideal parallel machine on the unreliable machine. A definite algorithm is one that guarantees a correct execution of a step, while a tentative algorithm is one that is "highly likely" to produce a correct execution of a step on the unreliable machine. We show that any definite execution of one step requires Cl(log n) time on an∗processor unreliable machine, even if all the processors functioned perfectly, This implies an l(log n) slowdown for executing any non-Trivial program on the unreliable machine, provided only definite executions are used. We get around this overhead by combining tentative and definite execution schemes appropriately, to derive correct and efllcient robust executions for arbitrary PRAM programs, with expected amortized slowdown of only 0(1) for a variety of reasonable failure models. We adeve this by using a tentative algorithm to execute each of the program's steps, while using a definite algorithm to audit the execution at selected points. If the audit does not certify the execution as correct, then the execution is rolled back to a previous audit point and restarted from there. In contrast to this work, all previous results required a slowdown of Cl(log n), since they used definite algorithms only.

AB - We present a general and efficient strategy for computing mtustly on unreliable parallel machines. The model of a parallel machine that we use is a CRCW PRAM with dynamic resource fluctuations: processors can fail during the computation, and may possibly bc restored later. We first introduce the notions of dejinite and tentatitie algorithms for executing a single parallel step of an ideal parallel machine on the unreliable machine. A definite algorithm is one that guarantees a correct execution of a step, while a tentative algorithm is one that is "highly likely" to produce a correct execution of a step on the unreliable machine. We show that any definite execution of one step requires Cl(log n) time on an∗processor unreliable machine, even if all the processors functioned perfectly, This implies an l(log n) slowdown for executing any non-Trivial program on the unreliable machine, provided only definite executions are used. We get around this overhead by combining tentative and definite execution schemes appropriately, to derive correct and efllcient robust executions for arbitrary PRAM programs, with expected amortized slowdown of only 0(1) for a variety of reasonable failure models. We adeve this by using a tentative algorithm to execute each of the program's steps, while using a definite algorithm to audit the execution at selected points. If the audit does not certify the execution as correct, then the execution is rolled back to a previous audit point and restarted from there. In contrast to this work, all previous results required a slowdown of Cl(log n), since they used definite algorithms only.

UR - http://www.scopus.com/inward/record.url?scp=84990703309&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84990703309&partnerID=8YFLogxK

M3 - Conference contribution

VL - Part F130073

SP - 381

EP - 390

BT - Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, STOC 1991

PB - Association for Computing Machinery

ER -