Abstract

Most server-grade systems provide Chipkill-Correct error protection at the expense of power and performance. In this paper we present a low overhead solution to improving the reliability of commodity DRAM systems with no change in the existing memory architecture. Specifically, we propose five erasure and error correction (E-ECC) schemes that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. Synthesis results in 28 nm node show that the decoding latency of these codes is negligible compared to the DRAM access latency. In addition, we make use of erasure codes to extend the lifetime of the DRAM systems. Specifically, once a chip is marked faulty due to persistent errors, all E-ECC schemes correct erasures due to that faulty chip and also correct an additional random error in a second chip. Evaluation with SPEC2006 workloads show that compared to x4 Chipkill-Correct schemes, Scheme 5 has the highest IPC improvement (mean of 7 percent) and Scheme 4 has the largest power reduction (mean of 18 percent) and the largest increase in energy efficiency (mean of 25 percent).

Original languageEnglish (US)
Article number7447716
Pages (from-to)3766-3779
Number of pages14
JournalIEEE Transactions on Computers
Volume65
Issue number12
DOIs
StatePublished - Dec 1 2016

Fingerprint

Dynamic random access storage
Error correction
Error Correction
Costs
Memory architecture
Percent
Random errors
Chip
Energy efficiency
Decoding
Servers
Latency
Lower Solution
Random Error
Energy Efficiency
Workload
Lifetime
Server
Synthesis
Evaluation

Keywords

  • chipkill-correct
  • DRAM Memory system
  • erasure and error correction
  • error control coding (ECC)
  • reliability

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems. / Chen, Hsing Min; Jeloka, Supreet; Arunkumar, Akhil; Blaauw, David; Wu, Carole-Jean; Mudge, Trevor; Chakrabarti, Chaitali.

In: IEEE Transactions on Computers, Vol. 65, No. 12, 7447716, 01.12.2016, p. 3766-3779.

Research output: Contribution to journalArticle

Chen, Hsing Min ; Jeloka, Supreet ; Arunkumar, Akhil ; Blaauw, David ; Wu, Carole-Jean ; Mudge, Trevor ; Chakrabarti, Chaitali. / Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems. In: IEEE Transactions on Computers. 2016 ; Vol. 65, No. 12. pp. 3766-3779.
@article{80489b75fff14eb4a837b12eae534571,
title = "Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems",
abstract = "Most server-grade systems provide Chipkill-Correct error protection at the expense of power and performance. In this paper we present a low overhead solution to improving the reliability of commodity DRAM systems with no change in the existing memory architecture. Specifically, we propose five erasure and error correction (E-ECC) schemes that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. Synthesis results in 28 nm node show that the decoding latency of these codes is negligible compared to the DRAM access latency. In addition, we make use of erasure codes to extend the lifetime of the DRAM systems. Specifically, once a chip is marked faulty due to persistent errors, all E-ECC schemes correct erasures due to that faulty chip and also correct an additional random error in a second chip. Evaluation with SPEC2006 workloads show that compared to x4 Chipkill-Correct schemes, Scheme 5 has the highest IPC improvement (mean of 7 percent) and Scheme 4 has the largest power reduction (mean of 18 percent) and the largest increase in energy efficiency (mean of 25 percent).",
keywords = "chipkill-correct, DRAM Memory system, erasure and error correction, error control coding (ECC), reliability",
author = "Chen, {Hsing Min} and Supreet Jeloka and Akhil Arunkumar and David Blaauw and Carole-Jean Wu and Trevor Mudge and Chaitali Chakrabarti",
year = "2016",
month = "12",
day = "1",
doi = "10.1109/TC.2016.2550455",
language = "English (US)",
volume = "65",
pages = "3766--3779",
journal = "IEEE Transactions on Computers",
issn = "0018-9340",
publisher = "IEEE Computer Society",
number = "12",

}

TY - JOUR

T1 - Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems

AU - Chen, Hsing Min

AU - Jeloka, Supreet

AU - Arunkumar, Akhil

AU - Blaauw, David

AU - Wu, Carole-Jean

AU - Mudge, Trevor

AU - Chakrabarti, Chaitali

PY - 2016/12/1

Y1 - 2016/12/1

N2 - Most server-grade systems provide Chipkill-Correct error protection at the expense of power and performance. In this paper we present a low overhead solution to improving the reliability of commodity DRAM systems with no change in the existing memory architecture. Specifically, we propose five erasure and error correction (E-ECC) schemes that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. Synthesis results in 28 nm node show that the decoding latency of these codes is negligible compared to the DRAM access latency. In addition, we make use of erasure codes to extend the lifetime of the DRAM systems. Specifically, once a chip is marked faulty due to persistent errors, all E-ECC schemes correct erasures due to that faulty chip and also correct an additional random error in a second chip. Evaluation with SPEC2006 workloads show that compared to x4 Chipkill-Correct schemes, Scheme 5 has the highest IPC improvement (mean of 7 percent) and Scheme 4 has the largest power reduction (mean of 18 percent) and the largest increase in energy efficiency (mean of 25 percent).

AB - Most server-grade systems provide Chipkill-Correct error protection at the expense of power and performance. In this paper we present a low overhead solution to improving the reliability of commodity DRAM systems with no change in the existing memory architecture. Specifically, we propose five erasure and error correction (E-ECC) schemes that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. Synthesis results in 28 nm node show that the decoding latency of these codes is negligible compared to the DRAM access latency. In addition, we make use of erasure codes to extend the lifetime of the DRAM systems. Specifically, once a chip is marked faulty due to persistent errors, all E-ECC schemes correct erasures due to that faulty chip and also correct an additional random error in a second chip. Evaluation with SPEC2006 workloads show that compared to x4 Chipkill-Correct schemes, Scheme 5 has the highest IPC improvement (mean of 7 percent) and Scheme 4 has the largest power reduction (mean of 18 percent) and the largest increase in energy efficiency (mean of 25 percent).

KW - chipkill-correct

KW - DRAM Memory system

KW - erasure and error correction

KW - error control coding (ECC)

KW - reliability

UR - http://www.scopus.com/inward/record.url?scp=84998679077&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84998679077&partnerID=8YFLogxK

U2 - 10.1109/TC.2016.2550455

DO - 10.1109/TC.2016.2550455

M3 - Article

AN - SCOPUS:84998679077

VL - 65

SP - 3766

EP - 3779

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

SN - 0018-9340

IS - 12

M1 - 7447716

ER -