Simplifying fault-tolerance: Providing the abstraction of crash failures

Rida Bazzi, Gil Neiger

Research output: Contribution to journalArticle

18 Citations (Scopus)

Abstract

The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate, especially for systems with synchronous message passing. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Such translations can be quantified by two measures: fault-tolerance, which is a measure of how many processors must remain correct for the translation to be correct, and round-complexity, which is a measure' of how the translation increases the running time of an algorithm. Understanding these translations and their limitations with respect to these measures can provide insight into the relative impact of different models of faulty behavior on the ability to provide fault-tolerant applications for systems with synchronous message passing. This paper considers translations from crash failures to each of the following types of more severe failures: omission to send messages; omission to send and receive messages; and totally arbitrary behavior. It shows that previously developed translations to send-omission failures are optimal with respect to both fault-tolerance and round-complexity. It exhibits a hierarchy of translations to general (send/receive) omission failures that improves upon the fault-tolerance of previously developed translations. These translations are optimal in that they cannot be improved with respect to one measure without negatively affecting the other; that is, the hierarchy of translations is matched by corresponding hierarchy of impossibility results. The paper also gives a hierarchy of translations to arbitrary failures that improves upon the round-complexity of previously developed translations. These translations are near-optimal; the hierarchy is matched by a hierarchy of impossibility results whose fault-tolerances differ from those of the translations only by small constants.

Original languageEnglish (US)
Pages (from-to)499-554
Number of pages56
JournalJournal of the ACM
Volume48
Issue number3
DOIs
StatePublished - May 2001

Fingerprint

Crash
Fault tolerance
Fault Tolerance
Message passing
Parallel algorithms
Abstraction
Message Passing
Fault-tolerant
Arbitrary
Distributed Algorithms
Hierarchy
Simplify

Keywords

  • Crash failures
  • Fault-tolerance
  • Translations

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems
  • Computer Graphics and Computer-Aided Design
  • Software
  • Theoretical Computer Science
  • Computational Theory and Mathematics

Cite this

Simplifying fault-tolerance : Providing the abstraction of crash failures. / Bazzi, Rida; Neiger, Gil.

In: Journal of the ACM, Vol. 48, No. 3, 05.2001, p. 499-554.

Research output: Contribution to journalArticle

@article{a9d87d96f4bc4536871b019c646b0f44,
title = "Simplifying fault-tolerance: Providing the abstraction of crash failures",
abstract = "The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate, especially for systems with synchronous message passing. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Such translations can be quantified by two measures: fault-tolerance, which is a measure of how many processors must remain correct for the translation to be correct, and round-complexity, which is a measure' of how the translation increases the running time of an algorithm. Understanding these translations and their limitations with respect to these measures can provide insight into the relative impact of different models of faulty behavior on the ability to provide fault-tolerant applications for systems with synchronous message passing. This paper considers translations from crash failures to each of the following types of more severe failures: omission to send messages; omission to send and receive messages; and totally arbitrary behavior. It shows that previously developed translations to send-omission failures are optimal with respect to both fault-tolerance and round-complexity. It exhibits a hierarchy of translations to general (send/receive) omission failures that improves upon the fault-tolerance of previously developed translations. These translations are optimal in that they cannot be improved with respect to one measure without negatively affecting the other; that is, the hierarchy of translations is matched by corresponding hierarchy of impossibility results. The paper also gives a hierarchy of translations to arbitrary failures that improves upon the round-complexity of previously developed translations. These translations are near-optimal; the hierarchy is matched by a hierarchy of impossibility results whose fault-tolerances differ from those of the translations only by small constants.",
keywords = "Crash failures, Fault-tolerance, Translations",
author = "Rida Bazzi and Gil Neiger",
year = "2001",
month = "5",
doi = "10.1145/382780.382784",
language = "English (US)",
volume = "48",
pages = "499--554",
journal = "Journal of the ACM",
issn = "0004-5411",
publisher = "Association for Computing Machinery (ACM)",
number = "3",

}

TY - JOUR

T1 - Simplifying fault-tolerance

T2 - Providing the abstraction of crash failures

AU - Bazzi, Rida

AU - Neiger, Gil

PY - 2001/5

Y1 - 2001/5

N2 - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate, especially for systems with synchronous message passing. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Such translations can be quantified by two measures: fault-tolerance, which is a measure of how many processors must remain correct for the translation to be correct, and round-complexity, which is a measure' of how the translation increases the running time of an algorithm. Understanding these translations and their limitations with respect to these measures can provide insight into the relative impact of different models of faulty behavior on the ability to provide fault-tolerant applications for systems with synchronous message passing. This paper considers translations from crash failures to each of the following types of more severe failures: omission to send messages; omission to send and receive messages; and totally arbitrary behavior. It shows that previously developed translations to send-omission failures are optimal with respect to both fault-tolerance and round-complexity. It exhibits a hierarchy of translations to general (send/receive) omission failures that improves upon the fault-tolerance of previously developed translations. These translations are optimal in that they cannot be improved with respect to one measure without negatively affecting the other; that is, the hierarchy of translations is matched by corresponding hierarchy of impossibility results. The paper also gives a hierarchy of translations to arbitrary failures that improves upon the round-complexity of previously developed translations. These translations are near-optimal; the hierarchy is matched by a hierarchy of impossibility results whose fault-tolerances differ from those of the translations only by small constants.

AB - The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate, especially for systems with synchronous message passing. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Such translations can be quantified by two measures: fault-tolerance, which is a measure of how many processors must remain correct for the translation to be correct, and round-complexity, which is a measure' of how the translation increases the running time of an algorithm. Understanding these translations and their limitations with respect to these measures can provide insight into the relative impact of different models of faulty behavior on the ability to provide fault-tolerant applications for systems with synchronous message passing. This paper considers translations from crash failures to each of the following types of more severe failures: omission to send messages; omission to send and receive messages; and totally arbitrary behavior. It shows that previously developed translations to send-omission failures are optimal with respect to both fault-tolerance and round-complexity. It exhibits a hierarchy of translations to general (send/receive) omission failures that improves upon the fault-tolerance of previously developed translations. These translations are optimal in that they cannot be improved with respect to one measure without negatively affecting the other; that is, the hierarchy of translations is matched by corresponding hierarchy of impossibility results. The paper also gives a hierarchy of translations to arbitrary failures that improves upon the round-complexity of previously developed translations. These translations are near-optimal; the hierarchy is matched by a hierarchy of impossibility results whose fault-tolerances differ from those of the translations only by small constants.

KW - Crash failures

KW - Fault-tolerance

KW - Translations

UR - http://www.scopus.com/inward/record.url?scp=0141441473&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0141441473&partnerID=8YFLogxK

U2 - 10.1145/382780.382784

DO - 10.1145/382780.382784

M3 - Article

AN - SCOPUS:0141441473

VL - 48

SP - 499

EP - 554

JO - Journal of the ACM

JF - Journal of the ACM

SN - 0004-5411

IS - 3

ER -