Abstract

Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill- Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability com- pared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades of reliabil- ity for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decod- ing latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 se- quential and multi-programmed workloads show that com- pared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power con- sumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.

Original languageEnglish (US)
Title of host publicationACM International Conference Proceeding Series
PublisherAssociation for Computing Machinery
Pages60-70
Number of pages11
Volume05-08-October-2015
ISBN (Print)9781450336048
DOIs
StatePublished - Oct 5 2015
Event1st International Symposium on Memory Systems, MEMSYS 2015 - Washington, United States
Duration: Aug 14 2015Aug 15 2015

Other

Other1st International Symposium on Memory Systems, MEMSYS 2015
CountryUnited States
CityWashington
Period8/14/158/15/15

Fingerprint

Dynamic random access storage
Error correction
Computer systems
Random errors
Data storage equipment
Memory architecture
Program processors
Energy efficiency
Decoding
Electric power utilization
Servers

Keywords

  • Chipkill- correct
  • Dramerrors
  • Drammemory system
  • Erasure and error correction
  • Reliability

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Chen, H. M., Arunkumar, A., Wu, C-J., Mudge, T., & Chakrabarti, C. (2015). E-ECC: Low power erasure and error correction schemes for increasing reliability of commodity DRAM systems. In ACM International Conference Proceeding Series (Vol. 05-08-October-2015, pp. 60-70). Association for Computing Machinery. https://doi.org/10.1145/2818950.2818961

E-ECC : Low power erasure and error correction schemes for increasing reliability of commodity DRAM systems. / Chen, Hsing Min; Arunkumar, Akhil; Wu, Carole-Jean; Mudge, Trevor; Chakrabarti, Chaitali.

ACM International Conference Proceeding Series. Vol. 05-08-October-2015 Association for Computing Machinery, 2015. p. 60-70.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chen, HM, Arunkumar, A, Wu, C-J, Mudge, T & Chakrabarti, C 2015, E-ECC: Low power erasure and error correction schemes for increasing reliability of commodity DRAM systems. in ACM International Conference Proceeding Series. vol. 05-08-October-2015, Association for Computing Machinery, pp. 60-70, 1st International Symposium on Memory Systems, MEMSYS 2015, Washington, United States, 8/14/15. https://doi.org/10.1145/2818950.2818961
Chen HM, Arunkumar A, Wu C-J, Mudge T, Chakrabarti C. E-ECC: Low power erasure and error correction schemes for increasing reliability of commodity DRAM systems. In ACM International Conference Proceeding Series. Vol. 05-08-October-2015. Association for Computing Machinery. 2015. p. 60-70 https://doi.org/10.1145/2818950.2818961
Chen, Hsing Min ; Arunkumar, Akhil ; Wu, Carole-Jean ; Mudge, Trevor ; Chakrabarti, Chaitali. / E-ECC : Low power erasure and error correction schemes for increasing reliability of commodity DRAM systems. ACM International Conference Proceeding Series. Vol. 05-08-October-2015 Association for Computing Machinery, 2015. pp. 60-70
@inproceedings{04d38f543a684970ad128fc3a36c780c,
title = "E-ECC: Low power erasure and error correction schemes for increasing reliability of commodity DRAM systems",
abstract = "Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill- Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5{\%} storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability com- pared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades of reliabil- ity for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986{\%} probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decod- ing latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 se- quential and multi-programmed workloads show that com- pared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2{\%} (maximum of 13.8{\%}) and 4.8{\%} (maximum of 31.8{\%}) and reduce the power con- sumption by an average of 16.2{\%} (maximum of 25{\%}) and 26.8{\%} (maximum of 36{\%}), respectively.",
keywords = "Chipkill- correct, Dramerrors, Drammemory system, Erasure and error correction, Reliability",
author = "Chen, {Hsing Min} and Akhil Arunkumar and Carole-Jean Wu and Trevor Mudge and Chaitali Chakrabarti",
year = "2015",
month = "10",
day = "5",
doi = "10.1145/2818950.2818961",
language = "English (US)",
isbn = "9781450336048",
volume = "05-08-October-2015",
pages = "60--70",
booktitle = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - E-ECC

T2 - Low power erasure and error correction schemes for increasing reliability of commodity DRAM systems

AU - Chen, Hsing Min

AU - Arunkumar, Akhil

AU - Wu, Carole-Jean

AU - Mudge, Trevor

AU - Chakrabarti, Chaitali

PY - 2015/10/5

Y1 - 2015/10/5

N2 - Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill- Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability com- pared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades of reliabil- ity for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decod- ing latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 se- quential and multi-programmed workloads show that com- pared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power con- sumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.

AB - Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill- Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability com- pared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades of reliabil- ity for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decod- ing latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 se- quential and multi-programmed workloads show that com- pared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power con- sumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.

KW - Chipkill- correct

KW - Dramerrors

KW - Drammemory system

KW - Erasure and error correction

KW - Reliability

UR - http://www.scopus.com/inward/record.url?scp=84959325256&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959325256&partnerID=8YFLogxK

U2 - 10.1145/2818950.2818961

DO - 10.1145/2818950.2818961

M3 - Conference contribution

SN - 9781450336048

VL - 05-08-October-2015

SP - 60

EP - 70

BT - ACM International Conference Proceeding Series

PB - Association for Computing Machinery

ER -