Simulating crash failures with many faulty processors

Rida Bazzi, Gil Neiger

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

Original languageEnglish (US)
Title of host publicationDistributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings
PublisherSpringer Verlag
Pages166-184
Number of pages19
Volume647 LNCS
ISBN (Print)9783540561880
StatePublished - 1992
Externally publishedYes
Event6th International Workshop on Distributed Algorithms, WDAG 1992 - Haifa, Israel
Duration: Nov 2 1992Nov 4 1992

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume647 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other6th International Workshop on Distributed Algorithms, WDAG 1992
CountryIsrael
CityHaifa
Period11/2/9211/4/92

Fingerprint

Crash
Fault tolerance
Fault Tolerance
Distributed computer systems
Parallel algorithms
Communication
Distributed Algorithms
Distributed Computing
Fault-tolerant
Simplify
Hierarchy
Range of data

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Bazzi, R., & Neiger, G. (1992). Simulating crash failures with many faulty processors. In Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings (Vol. 647 LNCS, pp. 166-184). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 647 LNCS). Springer Verlag.

Simulating crash failures with many faulty processors. / Bazzi, Rida; Neiger, Gil.

Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings. Vol. 647 LNCS Springer Verlag, 1992. p. 166-184 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 647 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bazzi, R & Neiger, G 1992, Simulating crash failures with many faulty processors. in Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings. vol. 647 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 647 LNCS, Springer Verlag, pp. 166-184, 6th International Workshop on Distributed Algorithms, WDAG 1992, Haifa, Israel, 11/2/92.
Bazzi R, Neiger G. Simulating crash failures with many faulty processors. In Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings. Vol. 647 LNCS. Springer Verlag. 1992. p. 166-184. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Bazzi, Rida ; Neiger, Gil. / Simulating crash failures with many faulty processors. Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings. Vol. 647 LNCS Springer Verlag, 1992. pp. 166-184 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{d3683daeb02a43328e81c3b436c03daa,
title = "Simulating crash failures with many faulty processors",
abstract = "The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.",
author = "Rida Bazzi and Gil Neiger",
year = "1992",
language = "English (US)",
isbn = "9783540561880",
volume = "647 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "166--184",
booktitle = "Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Simulating crash failures with many faulty processors

AU - Bazzi, Rida

AU - Neiger, Gil

PY - 1992

Y1 - 1992

N2 - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

AB - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

UR - http://www.scopus.com/inward/record.url?scp=0342365620&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0342365620&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9783540561880

VL - 647 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 166

EP - 184

BT - Distributed Algorithms - 6th International Workshop, WDAG 1992, Proceedings

PB - Springer Verlag

ER -