OuterSPACE

An Outer Product Based Sparse Matrix Multiplication Accelerator

Subhankar Pal, Jonathan Beaumont, Dong Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun Seok Kim, David Blaauw, Trevor Mudge, Ronald Dreslinski

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

Original languageEnglish (US)
Title of host publicationProceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018
PublisherIEEE Computer Society
Pages724-736
Number of pages13
Volume2018-February
ISBN (Electronic)9781538636596
DOIs
StatePublished - Mar 27 2018
Event24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018 - Vienna, Austria
Duration: Feb 24 2018Feb 28 2018

Other

Other24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018
CountryAustria
CityVienna
Period2/24/182/28/18

Fingerprint

Particle accelerators
Data storage equipment
Electric network analysis
Program processors
Learning systems
Throughput
Bandwidth

Keywords

  • Application specific hardware
  • Hardware accelerators
  • Hardware software co design
  • Parallel computer architecture
  • Sparse matrix processing

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Pal, S., Beaumont, J., Park, D. H., Amarnath, A., Feng, S., Chakrabarti, C., ... Dreslinski, R. (2018). OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018 (Vol. 2018-February, pp. 724-736). IEEE Computer Society. https://doi.org/10.1109/HPCA.2018.00067

OuterSPACE : An Outer Product Based Sparse Matrix Multiplication Accelerator. / Pal, Subhankar; Beaumont, Jonathan; Park, Dong Hyeon; Amarnath, Aporva; Feng, Siying; Chakrabarti, Chaitali; Kim, Hun Seok; Blaauw, David; Mudge, Trevor; Dreslinski, Ronald.

Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. Vol. 2018-February IEEE Computer Society, 2018. p. 724-736.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pal, S, Beaumont, J, Park, DH, Amarnath, A, Feng, S, Chakrabarti, C, Kim, HS, Blaauw, D, Mudge, T & Dreslinski, R 2018, OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. in Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. vol. 2018-February, IEEE Computer Society, pp. 724-736, 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, 2/24/18. https://doi.org/10.1109/HPCA.2018.00067
Pal S, Beaumont J, Park DH, Amarnath A, Feng S, Chakrabarti C et al. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. Vol. 2018-February. IEEE Computer Society. 2018. p. 724-736 https://doi.org/10.1109/HPCA.2018.00067
Pal, Subhankar ; Beaumont, Jonathan ; Park, Dong Hyeon ; Amarnath, Aporva ; Feng, Siying ; Chakrabarti, Chaitali ; Kim, Hun Seok ; Blaauw, David ; Mudge, Trevor ; Dreslinski, Ronald. / OuterSPACE : An Outer Product Based Sparse Matrix Multiplication Accelerator. Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. Vol. 2018-February IEEE Computer Society, 2018. pp. 724-736
@inproceedings{0ec2295a0b4646a8954fb4916d4e4188,
title = "OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator",
abstract = "Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.",
keywords = "Application specific hardware, Hardware accelerators, Hardware software co design, Parallel computer architecture, Sparse matrix processing",
author = "Subhankar Pal and Jonathan Beaumont and Park, {Dong Hyeon} and Aporva Amarnath and Siying Feng and Chaitali Chakrabarti and Kim, {Hun Seok} and David Blaauw and Trevor Mudge and Ronald Dreslinski",
year = "2018",
month = "3",
day = "27",
doi = "10.1109/HPCA.2018.00067",
language = "English (US)",
volume = "2018-February",
pages = "724--736",
booktitle = "Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - OuterSPACE

T2 - An Outer Product Based Sparse Matrix Multiplication Accelerator

AU - Pal, Subhankar

AU - Beaumont, Jonathan

AU - Park, Dong Hyeon

AU - Amarnath, Aporva

AU - Feng, Siying

AU - Chakrabarti, Chaitali

AU - Kim, Hun Seok

AU - Blaauw, David

AU - Mudge, Trevor

AU - Dreslinski, Ronald

PY - 2018/3/27

Y1 - 2018/3/27

N2 - Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

AB - Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

KW - Application specific hardware

KW - Hardware accelerators

KW - Hardware software co design

KW - Parallel computer architecture

KW - Sparse matrix processing

UR - http://www.scopus.com/inward/record.url?scp=85046825176&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046825176&partnerID=8YFLogxK

U2 - 10.1109/HPCA.2018.00067

DO - 10.1109/HPCA.2018.00067

M3 - Conference contribution

VL - 2018-February

SP - 724

EP - 736

BT - Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018

PB - IEEE Computer Society

ER -