TY - GEN
T1 - MCM-GPU
T2 - 44th Annual International Symposium on Computer Architecture - ISCA 2017
AU - Arunkumar, Akhil
AU - Bolotin, Evgeny
AU - Cho, Benjamin
AU - Milic, Ugljesa
AU - Ebrahimi, Eiman
AU - Villa, Oreste
AU - Jaleel, Aamer
AU - Wu, Carole Jean
AU - Nellans, David
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/6/24
Y1 - 2017/6/24
N2 - Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifcally, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power effcient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCMGPU) design. We then propose three architectural optimizations that signifcantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.
AB - Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifcally, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power effcient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCMGPU) design. We then propose three architectural optimizations that signifcantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.
KW - Graphics Processing Units
KW - Moore's Law
KW - Multi-Chip-Modules
KW - NUMA Systems
UR - http://www.scopus.com/inward/record.url?scp=85025581236&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85025581236&partnerID=8YFLogxK
U2 - 10.1145/3079856.3080231
DO - 10.1145/3079856.3080231
M3 - Conference contribution
AN - SCOPUS:85025581236
T3 - Proceedings - International Symposium on Computer Architecture
SP - 320
EP - 332
BT - ISCA 2017 - 44th Annual International Symposium on Computer Architecture - Conference Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 24 June 2017 through 28 June 2017
ER -