Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks

Yann Hang Lee, Kang G. Shin

Research output: Contribution to journalArticle

25 Scopus citations

Abstract

In this paper we consider the design and evaluation of a fault-tolerant multiprocessor with a rollback recovery mechanism. The rollback mechanism is based on the hardware recovery block which is a hardware equivalent to the software recovery block. The hardware recovery blocks are constructed by consecutive state-save operations and several state-save units in every processor and memory module. Upon detection of failure, the multiprocessor reconfigures itself to replace the faulty module and then the process originally assigned to the faulty module retreats to one of the previously saved states in order to resume fault-free execution. Due to random interactions among cooperating processes and also due to asynchrony in the state-savings, the rollback of a process may propagate to others and thus the need of multiple-step rollbacks may arise. In the worst case, when all the available saved states are exhausted, the processes have to restart from the beginning as if they were executed in a system without any rollback recovery mechanism. A mathematical model is proposed to calculate both the coverage of multistep rollback recovery and the risk of restart. Also presented is the evaluation of mean and variance of execution time of a given task with occurrence of rollbacks and/or restarts.

Original languageEnglish (US)
Pages (from-to)113-124
Number of pages12
JournalIEEE Transactions on Computers
VolumeC-33
Issue number2
DOIs
StatePublished - Feb 1984

Keywords

  • Fault-tolerant multiprocessor
  • hardware/ software recovery blocks
  • performance of rollback recovery mechanisms
  • rollback propagation

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint Dive into the research topics of 'Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks'. Together they form a unique fingerprint.

  • Cite this