Simulating crash failures with many faulty processors

Rida Bazzi; Gil Neiger

doi:10.1007/3-540-56188-9_12

Simulating crash failures with many faulty processors

Rida Bazzi, Gil Neiger

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

11 Scopus citations

Abstract

The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

Original language	English (US)
Title of host publication	Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings
Editors	Adrian Segall, Shmuel Zaks
Publisher	Springer Verlag
Pages	166-184
Number of pages	19
ISBN (Print)	9783540561880
DOIs	https://doi.org/10.1007/3-540-56188-9_12
State	Published - 1992
Externally published	Yes
Event	6th International Workshop on Distributed Algorithms, WDAG 1992 - Haifa, Israel Duration: Nov 2 1992 → Nov 4 1992

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	647 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	6th International Workshop on Distributed Algorithms, WDAG 1992
Country/Territory	Israel
City	Haifa
Period	11/2/92 → 11/4/92

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/3-540-56188-9_12

Cite this

Bazzi, R., & Neiger, G. (1992). Simulating crash failures with many faulty processors. In A. Segall, & S. Zaks (Eds.), Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings (pp. 166-184). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 647 LNCS). Springer Verlag. https://doi.org/10.1007/3-540-56188-9_12

Simulating crash failures with many faulty processors. / Bazzi, Rida; Neiger, Gil.
Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings. ed. / Adrian Segall; Shmuel Zaks. Springer Verlag, 1992. p. 166-184 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 647 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Bazzi, R & Neiger, G 1992, Simulating crash failures with many faulty processors. in A Segall & S Zaks (eds), Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 647 LNCS, Springer Verlag, pp. 166-184, 6th International Workshop on Distributed Algorithms, WDAG 1992, Haifa, Israel, 11/2/92. https://doi.org/10.1007/3-540-56188-9_12

Bazzi R, Neiger G. Simulating crash failures with many faulty processors. In Segall A, Zaks S, editors, Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings. Springer Verlag. 1992. p. 166-184. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/3-540-56188-9_12

@inproceedings{d3683daeb02a43328e81c3b436c03daa,

title = "Simulating crash failures with many faulty processors",

abstract = "The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.",

author = "Rida Bazzi and Gil Neiger",

note = "Funding Information: Distributed computer systems give algorithm designers the ability to write fault-tolerant applications in which correctly functioning processors can complete a computation despite the failure of others. It has been well-established that the complexity of writing such applications depends upon the type of faulty behavior that processors may exhibit. While simple stopping failures are relatively * Partial support for this work was provided by the National Science Foundation under grants CCR-8909663 and CCR-9106627. ** This author was supported in part by a scholarship from the Harirl Foundation.; 6th International Workshop on Distributed Algorithms, WDAG 1992 ; Conference date: 02-11-1992 Through 04-11-1992",

year = "1992",

doi = "10.1007/3-540-56188-9_12",

language = "English (US)",

isbn = "9783540561880",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "166--184",

editor = "Adrian Segall and Shmuel Zaks",

booktitle = "Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings",

}

TY - GEN

T1 - Simulating crash failures with many faulty processors

AU - Bazzi, Rida

AU - Neiger, Gil

N1 - Funding Information: Distributed computer systems give algorithm designers the ability to write fault-tolerant applications in which correctly functioning processors can complete a computation despite the failure of others. It has been well-established that the complexity of writing such applications depends upon the type of faulty behavior that processors may exhibit. While simple stopping failures are relatively * Partial support for this work was provided by the National Science Foundation under grants CCR-8909663 and CCR-9106627. ** This author was supported in part by a scholarship from the Harirl Foundation.

PY - 1992

Y1 - 1992

N2 - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

AB - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

UR - http://www.scopus.com/inward/record.url?scp=0342365620&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0342365620&partnerID=8YFLogxK

U2 - 10.1007/3-540-56188-9_12

DO - 10.1007/3-540-56188-9_12

M3 - Conference contribution

AN - SCOPUS:0342365620

SN - 9783540561880

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 166

EP - 184

BT - Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings

A2 - Segall, Adrian

A2 - Zaks, Shmuel

PB - Springer Verlag

T2 - 6th International Workshop on Distributed Algorithms, WDAG 1992

Y2 - 2 November 1992 through 4 November 1992

ER -

Simulating crash failures with many faulty processors

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this