OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

Subhankar Pal; Jonathan Beaumont; Dong Hyeon Park; Aporva Amarnath; Siying Feng; Chaitali Chakrabarti; Hun Seok Kim; David Blaauw; Trevor Mudge; Ronald Dreslinski

doi:10.1109/HPCA.2018.00067

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

Subhankar Pal, Jonathan Beaumont, Dong Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun Seok Kim, David Blaauw, Trevor Mudge, Ronald Dreslinski

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

162 Scopus citations

Abstract

Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

Original language	English (US)
Title of host publication	Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018
Publisher	IEEE Computer Society
Pages	724-736
Number of pages	13
ISBN (Electronic)	9781538636596
DOIs	https://doi.org/10.1109/HPCA.2018.00067
State	Published - Mar 27 2018
Event	24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018 - Vienna, Austria Duration: Feb 24 2018 → Feb 28 2018

Publication series

Name	Proceedings - International Symposium on High-Performance Computer Architecture
Volume	2018-February
ISSN (Print)	1530-0897

Other

Other	24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018
Country/Territory	Austria
City	Vienna
Period	2/24/18 → 2/28/18

Keywords

Application specific hardware
Hardware accelerators
Hardware software co design
Parallel computer architecture
Sparse matrix processing

ASJC Scopus subject areas

Hardware and Architecture

Access to Document

10.1109/HPCA.2018.00067

Cite this

Pal, S., Beaumont, J., Park, D. H., Amarnath, A., Feng, S., Chakrabarti, C., Kim, H. S., Blaauw, D., Mudge, T., & Dreslinski, R. (2018). OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018 (pp. 724-736). (Proceedings - International Symposium on High-Performance Computer Architecture; Vol. 2018-February). IEEE Computer Society. https://doi.org/10.1109/HPCA.2018.00067

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. / Pal, Subhankar; Beaumont, Jonathan; Park, Dong Hyeon et al.
Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. IEEE Computer Society, 2018. p. 724-736 (Proceedings - International Symposium on High-Performance Computer Architecture; Vol. 2018-February).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Pal, S, Beaumont, J, Park, DH, Amarnath, A, Feng, S, Chakrabarti, C, Kim, HS, Blaauw, D, Mudge, T & Dreslinski, R 2018, OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. in Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. Proceedings - International Symposium on High-Performance Computer Architecture, vol. 2018-February, IEEE Computer Society, pp. 724-736, 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, 2/24/18. https://doi.org/10.1109/HPCA.2018.00067

Pal S, Beaumont J, Park DH, Amarnath A, Feng S, Chakrabarti C et al. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. IEEE Computer Society. 2018. p. 724-736. (Proceedings - International Symposium on High-Performance Computer Architecture). doi: 10.1109/HPCA.2018.00067

@inproceedings{0ec2295a0b4646a8954fb4916d4e4188,

title = "OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator",

abstract = "Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.",

keywords = "Application specific hardware, Hardware accelerators, Hardware software co design, Parallel computer architecture, Sparse matrix processing",

author = "Subhankar Pal and Jonathan Beaumont and Park, {Dong Hyeon} and Aporva Amarnath and Siying Feng and Chaitali Chakrabarti and Kim, {Hun Seok} and David Blaauw and Trevor Mudge and Ronald Dreslinski",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018 ; Conference date: 24-02-2018 Through 28-02-2018",

year = "2018",

month = mar,

day = "27",

doi = "10.1109/HPCA.2018.00067",

language = "English (US)",

series = "Proceedings - International Symposium on High-Performance Computer Architecture",

publisher = "IEEE Computer Society",

pages = "724--736",

booktitle = "Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018",

}

TY - GEN

T1 - OuterSPACE

T2 - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018

AU - Pal, Subhankar

AU - Beaumont, Jonathan

AU - Park, Dong Hyeon

AU - Amarnath, Aporva

AU - Feng, Siying

AU - Chakrabarti, Chaitali

AU - Kim, Hun Seok

AU - Blaauw, David

AU - Mudge, Trevor

AU - Dreslinski, Ronald

PY - 2018/3/27

Y1 - 2018/3/27

N2 - Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

AB - Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

KW - Application specific hardware

KW - Hardware accelerators

KW - Hardware software co design

KW - Parallel computer architecture

KW - Sparse matrix processing

UR - http://www.scopus.com/inward/record.url?scp=85046825176&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046825176&partnerID=8YFLogxK

U2 - 10.1109/HPCA.2018.00067

DO - 10.1109/HPCA.2018.00067

M3 - Conference contribution

AN - SCOPUS:85046825176

T3 - Proceedings - International Symposium on High-Performance Computer Architecture

SP - 724

EP - 736

BT - Proceedings - 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018

PB - IEEE Computer Society

Y2 - 24 February 2018 through 28 February 2018

ER -

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this