Configurable-ECC

Architecting a Flexible ECC Scheme to Support Different Sized Accesses in High Bandwidth Memory Systems

Hsing Min Chen, Shin Ying Lee, Trevor Mudge, Carole-Jean Wu, Chaitali Chakrabarti

Research output: Contribution to journalArticle

Abstract

Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a memory subsystem supporting fine and coarse-grained data accesses to match the difference in the spatial locality of GPGPU applications. Current ECC designs, however, are developed for a fixed data fetch granularity. To have a more flexible design, we propose a novel memory protection scheme, called Config(urable)-ECC, which provides strong reliability for both fine and coarse-grained data accesses. Config-ECC consists of two tiers of ECC protection. The tier-1 code is a strong product code that can correct errors due to small granularity faults and detect errors caused by large granularity faults. The tier-2 code is an XOR-based code that is employed to correct errors incurred by large granularity faults. Config-ECC provides stronger reliability and/or lower energy consumption compared to state-of-the-art fixed 32B and 64B ECC schemes. It reduces the HBM energy by 17%-21% while reducing the failure in time (FIT) rate by 20 times compared to a state-of-the-art fixed 64B ECC scheme with an insignificant 1.2% performance overhead.

Original languageEnglish (US)
JournalIEEE Transactions on Computers
DOIs
StateAccepted/In press - Jan 1 2018

Fingerprint

Error correction
Error Correction
Bandwidth
Data storage equipment
Granularity
Fault
Graphics Processing Unit
Strong Product
GPGPU
Energy utilization
Locality
Energy Consumption
Subsystem
High Performance

Keywords

  • 3D DRAM
  • Bandwidth
  • Error Control Coding and GPU
  • Error correction codes
  • Graphics processing units
  • Memory Reliability
  • Random access memory
  • Reliability
  • Three-dimensional displays
  • Two dimensional displays

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

@article{2d11a28b2b3942519557bab6eaf20ac5,
title = "Configurable-ECC: Architecting a Flexible ECC Scheme to Support Different Sized Accesses in High Bandwidth Memory Systems",
abstract = "Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a memory subsystem supporting fine and coarse-grained data accesses to match the difference in the spatial locality of GPGPU applications. Current ECC designs, however, are developed for a fixed data fetch granularity. To have a more flexible design, we propose a novel memory protection scheme, called Config(urable)-ECC, which provides strong reliability for both fine and coarse-grained data accesses. Config-ECC consists of two tiers of ECC protection. The tier-1 code is a strong product code that can correct errors due to small granularity faults and detect errors caused by large granularity faults. The tier-2 code is an XOR-based code that is employed to correct errors incurred by large granularity faults. Config-ECC provides stronger reliability and/or lower energy consumption compared to state-of-the-art fixed 32B and 64B ECC schemes. It reduces the HBM energy by 17{\%}-21{\%} while reducing the failure in time (FIT) rate by 20 times compared to a state-of-the-art fixed 64B ECC scheme with an insignificant 1.2{\%} performance overhead.",
keywords = "3D DRAM, Bandwidth, Error Control Coding and GPU, Error correction codes, Graphics processing units, Memory Reliability, Random access memory, Reliability, Three-dimensional displays, Two dimensional displays",
author = "Chen, {Hsing Min} and Lee, {Shin Ying} and Trevor Mudge and Carole-Jean Wu and Chaitali Chakrabarti",
year = "2018",
month = "1",
day = "1",
doi = "10.1109/TC.2018.2886884",
language = "English (US)",
journal = "IEEE Transactions on Computers",
issn = "0018-9340",
publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - Configurable-ECC

T2 - Architecting a Flexible ECC Scheme to Support Different Sized Accesses in High Bandwidth Memory Systems

AU - Chen, Hsing Min

AU - Lee, Shin Ying

AU - Mudge, Trevor

AU - Wu, Carole-Jean

AU - Chakrabarti, Chaitali

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a memory subsystem supporting fine and coarse-grained data accesses to match the difference in the spatial locality of GPGPU applications. Current ECC designs, however, are developed for a fixed data fetch granularity. To have a more flexible design, we propose a novel memory protection scheme, called Config(urable)-ECC, which provides strong reliability for both fine and coarse-grained data accesses. Config-ECC consists of two tiers of ECC protection. The tier-1 code is a strong product code that can correct errors due to small granularity faults and detect errors caused by large granularity faults. The tier-2 code is an XOR-based code that is employed to correct errors incurred by large granularity faults. Config-ECC provides stronger reliability and/or lower energy consumption compared to state-of-the-art fixed 32B and 64B ECC schemes. It reduces the HBM energy by 17%-21% while reducing the failure in time (FIT) rate by 20 times compared to a state-of-the-art fixed 64B ECC scheme with an insignificant 1.2% performance overhead.

AB - Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a memory subsystem supporting fine and coarse-grained data accesses to match the difference in the spatial locality of GPGPU applications. Current ECC designs, however, are developed for a fixed data fetch granularity. To have a more flexible design, we propose a novel memory protection scheme, called Config(urable)-ECC, which provides strong reliability for both fine and coarse-grained data accesses. Config-ECC consists of two tiers of ECC protection. The tier-1 code is a strong product code that can correct errors due to small granularity faults and detect errors caused by large granularity faults. The tier-2 code is an XOR-based code that is employed to correct errors incurred by large granularity faults. Config-ECC provides stronger reliability and/or lower energy consumption compared to state-of-the-art fixed 32B and 64B ECC schemes. It reduces the HBM energy by 17%-21% while reducing the failure in time (FIT) rate by 20 times compared to a state-of-the-art fixed 64B ECC scheme with an insignificant 1.2% performance overhead.

KW - 3D DRAM

KW - Bandwidth

KW - Error Control Coding and GPU

KW - Error correction codes

KW - Graphics processing units

KW - Memory Reliability

KW - Random access memory

KW - Reliability

KW - Three-dimensional displays

KW - Two dimensional displays

UR - http://www.scopus.com/inward/record.url?scp=85058879038&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058879038&partnerID=8YFLogxK

U2 - 10.1109/TC.2018.2886884

DO - 10.1109/TC.2018.2886884

M3 - Article

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

SN - 0018-9340

ER -