Optimal design and use of retry in fault-tolerant computer systems

Yann Heng Lee, Kang G. Shin

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

In this paper, a new method is presented for (i) determining an optimal retry policy and (ii) using retry for fault characterization, which is defined as classification of the fault type and determination of fault durations. First, an optimal retry policy is derived for a given fault characteristic, which determines the maximum allowable retry durations so as to minimize the total task completion time. Then, the combined fault characterization and retry decision, in which the characteristic of a fault is estimated simultaneously with the determination of the optimal retry policy, are carried out. Two solution approaches are developed: one is based on point estimation and the other on Bayes sequential decision analysis. Numerical examples are presented in which all the durations associated with faults (i.e., active, benign, and interfailure durations) have monotone hazard rate functions (e.g., exponential Weibull and gamma distributions). These are standard distributions commonly used for modeling and analyses of faults.

Original languageEnglish (US)
Pages (from-to)45-69
Number of pages25
JournalJournal of the ACM (JACM)
Volume35
Issue number1
DOIs
StatePublished - Jan 1 1988
Externally publishedYes

Keywords

  • Bayes decision problem
  • estimation
  • fault characteristic
  • hypothesis testing
  • optimal retry

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Information Systems
  • Hardware and Architecture
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Optimal design and use of retry in fault-tolerant computer systems'. Together they form a unique fingerprint.

Cite this