Revisiting symptom-based fault tolerant techniques against soft errors

Hwisoo So, Moslem Didehban, Yohan Ko, Reiley Jeyapaul, Jongho Kim, Youngbin Kim, Kyoungwoo Lee, Aviral Shrivastava

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Aggressive technology scaling and near-threshold computing have made soft error reliability one of the leading design considerations in modern embedded microprocessors. Although traditional hardware/software redundancy-based schemes can provide a high level of protection, they incur significant overheads in terms of performance and hardware resources. The considerable overheads from such full redundancy-based techniques has motivated researchers to propose low-cost soft error protection schemes, such as symptom-based error protection schemes. The main idea behind a symptom-based error protection scheme is that soft errors in the system will quickly generate some symptoms, such as exceptions, branch mispredictions, cache or TLB misses, or unpredictable variable values. Therefore, monitoring such infrequent symptoms makes it possible to cover the manifestation of failures caused by soft errors. Symptom-based protection schemes have been suggested as shortcuts to achieve acceptable reliability with comparable overheads. Since the symptom-based protection schemes seem attractive due to their generality and simplicity, even state-of-the-art protection schemes exploit them as the baseline protections. However, our detailed analysis of the fault coverage and performance overheads of such schemes reveals that the user-visible failure coverage, particularly of ReStore, is limited (29% on average). By contrast, the runtime overheads are significant (40% on average) because the majority of the fault injection experiments, which were considered as detected/recovered failures by low-level symptoms, are actually benign faults by program-level masking effects.

Original languageEnglish (US)
Article number3028
JournalElectronics (Switzerland)
Volume10
Issue number23
DOIs
StatePublished - Dec 1 2021

Keywords

  • Embedded systems
  • Fault tolerance
  • Protection technique
  • Soft error
  • Symptoms
  • Transient fault

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Signal Processing
  • Hardware and Architecture
  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Revisiting symptom-based fault tolerant techniques against soft errors'. Together they form a unique fingerprint.

Cite this