A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm \times 2.6-mm chip exhibits 12.6 \times (8.4 \times ) energy efficiency gain, 11.7 \times (77.6 \times ) off-chip bandwidth efficiency gain, and 17.1 \times (36.9 \times ) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.
- Decoupled access execution
- reconfigurablility and accelerator
- sparse matrix multiplier
- synthesizable crossbar
ASJC Scopus subject areas
- Electrical and Electronic Engineering