TY - JOUR
T1 - A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator
AU - Park, Dong Hyeon
AU - Pal, Subhankar
AU - Feng, Siying
AU - Gao, Paul
AU - Tan, Jielun
AU - Rovinski, Austin
AU - Xie, Shaolin
AU - Zhao, Chun
AU - Amarnath, Aporva
AU - Wesley, Timothy
AU - Beaumont, Jonathan
AU - Chen, Kuan Yu
AU - Chakrabarti, Chaitali
AU - Taylor, Michael Bedford
AU - Mudge, Trevor
AU - Blaauw, David
AU - Kim, Hun Seok
AU - Dreslinski, Ronald G.
N1 - Funding Information:
Manuscript received August 26, 2019; revised November 3, 2019; accepted December 5, 2019. Date of publication January 1, 2020; date of current version March 26, 2020. This article was approved by Guest Editor Ken Takeuchi. This work was supported in part by the Air Force Research Laboratory (AFRL) and in part by the Defense Advanced Research Projects Agency (DARPA) under Grant FA8650-18-2-7864. (Corresponding author: Dong-Hyeon Park.) D.-H. Park, S. Pal, S. Feng, J. Tan, A. Rovinski, A. Amarnath, J. Beaumont, T. Mudge, and R. G. Dreslinski are with the Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI 48109 USA (e-mail: dohypark@umich.edu).
Publisher Copyright:
© 1966-2012 IEEE.
PY - 2020/4
Y1 - 2020/4
N2 - A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm \times 2.6-mm chip exhibits 12.6 \times (8.4 \times ) energy efficiency gain, 11.7 \times (77.6 \times ) off-chip bandwidth efficiency gain, and 17.1 \times (36.9 \times ) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.
AB - A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm \times 2.6-mm chip exhibits 12.6 \times (8.4 \times ) energy efficiency gain, 11.7 \times (77.6 \times ) off-chip bandwidth efficiency gain, and 17.1 \times (36.9 \times ) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.
KW - Decoupled access execution
KW - reconfigurablility and accelerator
KW - sparse matrix multiplier
KW - synthesizable crossbar
UR - http://www.scopus.com/inward/record.url?scp=85082859300&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85082859300&partnerID=8YFLogxK
U2 - 10.1109/JSSC.2019.2960480
DO - 10.1109/JSSC.2019.2960480
M3 - Article
AN - SCOPUS:85082859300
SN - 0018-9200
VL - 55
SP - 933
EP - 944
JO - IEEE Journal of Solid-State Circuits
JF - IEEE Journal of Solid-State Circuits
IS - 4
M1 - 8947989
ER -