Checkpointing exascale memory systems with existing memory technologies

Nilmini Abeyratne; Hsing Min Chen; Byoungchan Oh; Ronald Dreslinski; Chaitali Chakrabarti; Trevor Mudge

doi:10.1145/2989081.2989121

Checkpointing exascale memory systems with existing memory technologies

Nilmini Abeyratne, Hsing Min Chen, Byoungchan Oh, Ronald Dreslinski, Chaitali Chakrabarti, Trevor Mudge

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

5 Scopus citations

Abstract

Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2× - from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10× (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.

Original language	English (US)
Title of host publication	MEMSYS 2016 - Proceedings of the International Symposium on Memory Systems
Publisher	Association for Computing Machinery
Pages	18-29
Number of pages	12
ISBN (Electronic)	9781450343053
DOIs	https://doi.org/10.1145/2989081.2989121
State	Published - Oct 3 2016
Event	2nd International Symposium on Memory Systems, MEMSYS 2016 - Washington, United States Duration: Oct 3 2016 → Oct 6 2016

Publication series

Name	ACM International Conference Proceeding Series
Volume	03-06-October-2016

Other

Other	2nd International Symposium on Memory Systems, MEMSYS 2016
Country/Territory	United States
City	Washington
Period	10/3/16 → 10/6/16

Keywords

Checkpoint/restart
ECC
Exascale
Fault tolerance

ASJC Scopus subject areas

Software
Human-Computer Interaction
Computer Vision and Pattern Recognition
Computer Networks and Communications

Access to Document

10.1145/2989081.2989121

Cite this

Abeyratne, N., Chen, H. M., Oh, B., Dreslinski, R., Chakrabarti, C., & Mudge, T. (2016). Checkpointing exascale memory systems with existing memory technologies. In MEMSYS 2016 - Proceedings of the International Symposium on Memory Systems (pp. 18-29). (ACM International Conference Proceeding Series; Vol. 03-06-October-2016). Association for Computing Machinery. https://doi.org/10.1145/2989081.2989121

Checkpointing exascale memory systems with existing memory technologies. / Abeyratne, Nilmini; Chen, Hsing Min; Oh, Byoungchan et al.
MEMSYS 2016 - Proceedings of the International Symposium on Memory Systems. Association for Computing Machinery, 2016. p. 18-29 (ACM International Conference Proceeding Series; Vol. 03-06-October-2016).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abeyratne, N, Chen, HM, Oh, B, Dreslinski, R, Chakrabarti, C & Mudge, T 2016, Checkpointing exascale memory systems with existing memory technologies. in MEMSYS 2016 - Proceedings of the International Symposium on Memory Systems. ACM International Conference Proceeding Series, vol. 03-06-October-2016, Association for Computing Machinery, pp. 18-29, 2nd International Symposium on Memory Systems, MEMSYS 2016, Washington, United States, 10/3/16. https://doi.org/10.1145/2989081.2989121

@inproceedings{320539c3f7274fc6bbf239df8876ce40,

title = "Checkpointing exascale memory systems with existing memory technologies",

abstract = "Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2× - from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10× (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.",

keywords = "Checkpoint/restart, ECC, Exascale, Fault tolerance",

author = "Nilmini Abeyratne and Chen, {Hsing Min} and Byoungchan Oh and Ronald Dreslinski and Chaitali Chakrabarti and Trevor Mudge",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 2nd International Symposium on Memory Systems, MEMSYS 2016 ; Conference date: 03-10-2016 Through 06-10-2016",

year = "2016",

month = oct,

day = "3",

doi = "10.1145/2989081.2989121",

language = "English (US)",

series = "ACM International Conference Proceeding Series",

publisher = "Association for Computing Machinery",

pages = "18--29",

booktitle = "MEMSYS 2016 - Proceedings of the International Symposium on Memory Systems",

}

TY - GEN

T1 - Checkpointing exascale memory systems with existing memory technologies

AU - Abeyratne, Nilmini

AU - Chen, Hsing Min

AU - Oh, Byoungchan

AU - Dreslinski, Ronald

AU - Chakrabarti, Chaitali

AU - Mudge, Trevor

PY - 2016/10/3

Y1 - 2016/10/3

N2 - Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2× - from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10× (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.

AB - Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2× - from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10× (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.

KW - Checkpoint/restart

KW - ECC

KW - Exascale

KW - Fault tolerance

UR - http://www.scopus.com/inward/record.url?scp=84995495264&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84995495264&partnerID=8YFLogxK

U2 - 10.1145/2989081.2989121

DO - 10.1145/2989081.2989121

M3 - Conference contribution

AN - SCOPUS:84995495264

T3 - ACM International Conference Proceeding Series

SP - 18

EP - 29

BT - MEMSYS 2016 - Proceedings of the International Symposium on Memory Systems

PB - Association for Computing Machinery

T2 - 2nd International Symposium on Memory Systems, MEMSYS 2016

Y2 - 3 October 2016 through 6 October 2016

ER -

Checkpointing exascale memory systems with existing memory technologies

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this