MCM-GPU: Multi-chip-module GPUs for continued performance scalability

Akhil Arunkumar; Evgeny Bolotin; Benjamin Cho; Ugljesa Milic; Eiman Ebrahimi; Oreste Villa; Aamer Jaleel; Carole Jean Wu; David Nellans

doi:10.1145/3079856.3080231

MCM-GPU: Multi-chip-module GPUs for continued performance scalability

Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole Jean Wu, David Nellans

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

116 Scopus citations

Abstract

Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifcally, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power effcient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCMGPU) design. We then propose three architectural optimizations that signifcantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

Original language	English (US)
Title of host publication	ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	320-332
Number of pages	13
ISBN (Electronic)	9781450348928
DOIs	https://doi.org/10.1145/3079856.3080231
State	Published - Jun 24 2017
Externally published	Yes
Event	44th Annual International Symposium on Computer Architecture - ISCA 2017 - Toronto, Canada Duration: Jun 24 2017 → Jun 28 2017

Publication series

Name	Proceedings - International Symposium on Computer Architecture
Volume	Part F128643
ISSN (Print)	1063-6897

Other

Other	44th Annual International Symposium on Computer Architecture - ISCA 2017
Country/Territory	Canada
City	Toronto
Period	6/24/17 → 6/28/17

Keywords

Graphics Processing Units
Moore's Law
Multi-Chip-Modules
NUMA Systems

ASJC Scopus subject areas

Hardware and Architecture

Access to Document

10.1145/3079856.3080231

Cite this

Arunkumar, A., Bolotin, E., Cho, B., Milic, U., Ebrahimi, E., Villa, O., Jaleel, A., Wu, C. J., & Nellans, D. (2017). MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings (pp. 320-332). (Proceedings - International Symposium on Computer Architecture; Vol. Part F128643). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/3079856.3080231

MCM-GPU: Multi-chip-module GPUs for continued performance scalability. / Arunkumar, Akhil; Bolotin, Evgeny; Cho, Benjamin et al.
ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. p. 320-332 (Proceedings - International Symposium on Computer Architecture; Vol. Part F128643).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Arunkumar, A, Bolotin, E, Cho, B, Milic, U, Ebrahimi, E, Villa, O, Jaleel, A, Wu, CJ & Nellans, D 2017, MCM-GPU: Multi-chip-module GPUs for continued performance scalability. in ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings. Proceedings - International Symposium on Computer Architecture, vol. Part F128643, Institute of Electrical and Electronics Engineers Inc., pp. 320-332, 44th Annual International Symposium on Computer Architecture - ISCA 2017, Toronto, Canada, 6/24/17. https://doi.org/10.1145/3079856.3080231

Arunkumar A, Bolotin E, Cho B, Milic U, Ebrahimi E, Villa O et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. p. 320-332. (Proceedings - International Symposium on Computer Architecture). doi: 10.1145/3079856.3080231

Arunkumar, Akhil ; Bolotin, Evgeny ; Cho, Benjamin et al. / MCM-GPU : Multi-chip-module GPUs for continued performance scalability. ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 320-332 (Proceedings - International Symposium on Computer Architecture).

@inproceedings{a186d62c102a4e2581bc9ed038120c88,

title = "MCM-GPU: Multi-chip-module GPUs for continued performance scalability",

abstract = "Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifcally, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power effcient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCMGPU) design. We then propose three architectural optimizations that signifcantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.",

keywords = "Graphics Processing Units, Moore's Law, Multi-Chip-Modules, NUMA Systems",

author = "Akhil Arunkumar and Evgeny Bolotin and Benjamin Cho and Ugljesa Milic and Eiman Ebrahimi and Oreste Villa and Aamer Jaleel and Wu, {Carole Jean} and David Nellans",

note = "Publisher Copyright: {\textcopyright} 2017 Association for Computing Machinery.; 44th Annual International Symposium on Computer Architecture - ISCA 2017 ; Conference date: 24-06-2017 Through 28-06-2017",

year = "2017",

month = jun,

day = "24",

doi = "10.1145/3079856.3080231",

language = "English (US)",

series = "Proceedings - International Symposium on Computer Architecture",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "320--332",

booktitle = "ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings",

}

TY - GEN

T1 - MCM-GPU

T2 - 44th Annual International Symposium on Computer Architecture - ISCA 2017

AU - Arunkumar, Akhil

AU - Bolotin, Evgeny

AU - Cho, Benjamin

AU - Milic, Ugljesa

AU - Ebrahimi, Eiman

AU - Villa, Oreste

AU - Jaleel, Aamer

AU - Wu, Carole Jean

AU - Nellans, David

PY - 2017/6/24

Y1 - 2017/6/24

N2 - Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifcally, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power effcient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCMGPU) design. We then propose three architectural optimizations that signifcantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

AB - Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifcally, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power effcient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCMGPU) design. We then propose three architectural optimizations that signifcantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

KW - Graphics Processing Units

KW - Moore's Law

KW - Multi-Chip-Modules

KW - NUMA Systems

UR - http://www.scopus.com/inward/record.url?scp=85025581236&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85025581236&partnerID=8YFLogxK

U2 - 10.1145/3079856.3080231

DO - 10.1145/3079856.3080231

M3 - Conference contribution

AN - SCOPUS:85025581236

T3 - Proceedings - International Symposium on Computer Architecture

SP - 320

EP - 332

BT - ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 24 June 2017 through 28 June 2017

ER -

MCM-GPU: Multi-chip-module GPUs for continued performance scalability

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this