Abstract
In this paper we consider the design and evaluation of a fault-tolerant multiprocessor with a rollback recovery mechanism. The rollback mechanism is based on the hardware recovery block which is a hardware equivalent to the software recovery block. The hardware recovery blocks are constructed by consecutive state-save operations and several state-save units in every processor and memory module. Upon detection of failure, the multiprocessor reconfigures itself to replace the faulty module and then the process originally assigned to the faulty module retreats to one of the previously saved states in order to resume fault-free execution. Due to random interactions among cooperating processes and also due to asynchrony in the state-savings, the rollback of a process may propagate to others and thus the need of multiple-step rollbacks may arise. In the worst case, when all the available saved states are exhausted, the processes have to restart from the beginning as if they were executed in a system without any rollback recovery mechanism. A mathematical model is proposed to calculate both the coverage of multistep rollback recovery and the risk of restart. Also presented is the evaluation of mean and variance of execution time of a given task with occurrence of rollbacks and/or restarts.
Original language | English (US) |
---|---|
Pages (from-to) | 113-124 |
Number of pages | 12 |
Journal | IEEE Transactions on Computers |
Volume | C-33 |
Issue number | 2 |
DOIs | |
State | Published - Feb 1984 |
Externally published | Yes |
Keywords
- Fault-tolerant multiprocessor
- hardware/ software recovery blocks
- performance of rollback recovery mechanisms
- rollback propagation
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics