Checkpointing exascale memory systems with existing memory technologies

Nilmini Abeyratne, Hsing Min Chen, Byoungchan Oh, Ronald Dreslinski, Chaitali Chakrabarti, Trevor Mudge

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2× - from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10× (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.

Original languageEnglish (US)
Title of host publicationMEMSYS 2016 - Proceedings of the International Symposium on Memory Systems
PublisherAssociation for Computing Machinery
Pages18-29
Number of pages12
ISBN (Electronic)9781450343053
DOIs
StatePublished - Oct 3 2016
Event2nd International Symposium on Memory Systems, MEMSYS 2016 - Washington, United States
Duration: Oct 3 2016Oct 6 2016

Publication series

NameACM International Conference Proceeding Series
Volume03-06-October-2016

Other

Other2nd International Symposium on Memory Systems, MEMSYS 2016
Country/TerritoryUnited States
CityWashington
Period10/3/1610/6/16

Keywords

  • Checkpoint/restart
  • ECC
  • Exascale
  • Fault tolerance

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Checkpointing exascale memory systems with existing memory technologies'. Together they form a unique fingerprint.

Cite this