DMazerunner: Executing perfectly nested loops on dataflow accelerators

Shail Dave; Youngbin Kim; Sasikanth Avancha; Kyoungwoo Lee; Aviral Shrivastava

doi:10.1145/3358198

DMazerunner: Executing perfectly nested loops on dataflow accelerators

Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, Aviral Shrivastava

Research output: Contribution to journal › Article › peer-review

53 Scopus citations

Abstract

Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

Original language	English (US)
Article number	a70
Journal	ACM Transactions on Embedded Computing Systems
Volume	18
Issue number	5s
DOIs	https://doi.org/10.1145/3358198
State	Published - Oct 2019

Keywords

Analytical model
Coarse-grained reconfigurable array
Dataflow
Deep neural networks
Design space exploration
Energy-efficiency
Loop optimization
Mapping
Systolic arrays

ASJC Scopus subject areas

Software
Hardware and Architecture

Access to Document

10.1145/3358198

Cite this

@article{547360d33e53475ea893737d4b9248ad,

title = "DMazerunner: Executing perfectly nested loops on dataflow accelerators",

abstract = "Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.",

keywords = "Analytical model, Coarse-grained reconfigurable array, Dataflow, Deep neural networks, Design space exploration, Energy-efficiency, Loop optimization, Mapping, Systolic arrays",

author = "Shail Dave and Youngbin Kim and Sasikanth Avancha and Kyoungwoo Lee and Aviral Shrivastava",

note = "Funding Information: We thank the anonymous reviewers for their valuable feedback and suggestions and Mr. Sagar Parekh at Compiler Mircoarchitecture Lab, ASU for assisting in automation of some evaluations. This research was partially supported by funding from National Science Foundation under grant CCF 1723476 - NSF/Intel joint research center for Computer Assisted Programming for Heterogeneous Architectures (CAPA), and from the grants NRF-2015M3C4A7065522 (Next-generation Information Computing Development Program, funded by National Research Foundation of Korea, MSIT) and 2014-3-00035 (Research on High Performance and Scalable Manycore Operating System, funded by IITP, MSIT). Any opinions, findings, and conclusions presented in this material are those of the authors and do not necessarily reflect the views of their employers or the sponsoring agencies. Publisher Copyright: {\textcopyright} 2019 Association for Computing Machinery.",

year = "2019",

month = oct,

doi = "10.1145/3358198",

language = "English (US)",

volume = "18",

journal = "ACM Transactions on Embedded Computing Systems",

issn = "1539-9087",

publisher = "Association for Computing Machinery (ACM)",

number = "5s",

}

TY - JOUR

T1 - DMazerunner

T2 - Executing perfectly nested loops on dataflow accelerators

AU - Dave, Shail

AU - Kim, Youngbin

AU - Avancha, Sasikanth

AU - Lee, Kyoungwoo

AU - Shrivastava, Aviral

N1 - Funding Information: We thank the anonymous reviewers for their valuable feedback and suggestions and Mr. Sagar Parekh at Compiler Mircoarchitecture Lab, ASU for assisting in automation of some evaluations. This research was partially supported by funding from National Science Foundation under grant CCF 1723476 - NSF/Intel joint research center for Computer Assisted Programming for Heterogeneous Architectures (CAPA), and from the grants NRF-2015M3C4A7065522 (Next-generation Information Computing Development Program, funded by National Research Foundation of Korea, MSIT) and 2014-3-00035 (Research on High Performance and Scalable Manycore Operating System, funded by IITP, MSIT). Any opinions, findings, and conclusions presented in this material are those of the authors and do not necessarily reflect the views of their employers or the sponsoring agencies. Publisher Copyright: © 2019 Association for Computing Machinery.

PY - 2019/10

Y1 - 2019/10

N2 - Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

AB - Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

KW - Analytical model

KW - Coarse-grained reconfigurable array

KW - Dataflow

KW - Deep neural networks

KW - Design space exploration

KW - Energy-efficiency

KW - Loop optimization

KW - Mapping

KW - Systolic arrays

UR - http://www.scopus.com/inward/record.url?scp=85073149210&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85073149210&partnerID=8YFLogxK

U2 - 10.1145/3358198

DO - 10.1145/3358198

M3 - Article

AN - SCOPUS:85073149210

SN - 1539-9087

VL - 18

JO - ACM Transactions on Embedded Computing Systems

JF - ACM Transactions on Embedded Computing Systems

IS - 5s

M1 - a70

ER -

DMazerunner: Executing perfectly nested loops on dataflow accelerators

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this