TY - GEN
T1 - RAMP
T2 - 55th Annual Design Automation Conference, DAC 2018
AU - Dave, Shail
AU - Balasubramanian, Mahesh
AU - Shrivastava, Aviral
N1 - Funding Information:
However, computation time of RAMP is comparable to REGIMap and MEMMap (in order of seconds), if not always better. Essentially, this stems from higher mapping quality (fewer iterations due to 2× better II) and far less nodes to be mapped (i.e., smaller n) in any of the attempts. For example, both REGIMap and MEMMap load the live-in data from the memory [10, 13, 18]. Plus, REGIMap cannot spill the data and requires many routing operations, when constrained by the availability of few local registers. Similarly, MEMMap often routes data via memory, even if enough registers are available. Thus, they have to map 1.5×-2× nodes than RAMP. 7 SUMMARY This paper presents challenges with existing mapping techniques, which are unable to make good use of the routing resources. They first schedule the DDG and then attempt the P&R; routing is internal to P&R and is carried out in an ad-hoc manner. As a result, the operations may not be mapped due to resource constraints. This paper introduces RAMP which models various routing strategies explicitly and flexibly explore various ways to map the data dependencies while exploiting the CGRA resources. RAMP accelerates the top performance-critical loops of MiBench by 23× over a sequential execution, and by 2.13× over state-of-the-art techniques. ACKNOWLEDGMENTS This work was partially supported by funding from NSF grants CNS 1525855 and CCF 172346 - NSF/Intel joint research center for Computer Assisted Programming for Heterogeneous Architectures (CAPA). REFERENCES
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/6/24
Y1 - 2018/6/24
N2 - Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate even non-parallel loops. Acceleration achieved through CGRAs critically depends on the goodness of mapping (of loop operations onto the PEs of CGRA), and in particular, the compiler's ability to route the dependencies among operations. Previous works have explored several mechanisms to route data dependencies, including, routing through other PEs, registers, memory, and even re-computation. All these routing options change the graph to be mapped onto PEs (often by adding new operations), and without re-scheduling, it may be impossible to map the new graph. However, existing techniques explore these routing options inside the Place and Route (P&R) phase of the compilation process, which is performed after the scheduling step. As a result, they either may not achieve the mapping or obtain poor results. Our method RAMP, explicitly and intelligently explores the various routing options, before the scheduling step, and makes improve the mapping-ability and mapping quality. Evaluating top performance-critical loops of MiBench benchmarks over 12 architectural configurations, we find that RAMP is able to accelerate loops by 23× over sequential execution, achieving a geomean speedup of 2.13× over state-of-the-art.
AB - Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate even non-parallel loops. Acceleration achieved through CGRAs critically depends on the goodness of mapping (of loop operations onto the PEs of CGRA), and in particular, the compiler's ability to route the dependencies among operations. Previous works have explored several mechanisms to route data dependencies, including, routing through other PEs, registers, memory, and even re-computation. All these routing options change the graph to be mapped onto PEs (often by adding new operations), and without re-scheduling, it may be impossible to map the new graph. However, existing techniques explore these routing options inside the Place and Route (P&R) phase of the compilation process, which is performed after the scheduling step. As a result, they either may not achieve the mapping or obtain poor results. Our method RAMP, explicitly and intelligently explores the various routing options, before the scheduling step, and makes improve the mapping-ability and mapping quality. Evaluating top performance-critical loops of MiBench benchmarks over 12 architectural configurations, we find that RAMP is able to accelerate loops by 23× over sequential execution, achieving a geomean speedup of 2.13× over state-of-the-art.
UR - http://www.scopus.com/inward/record.url?scp=85053670482&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85053670482&partnerID=8YFLogxK
U2 - 10.1145/3195970.3196101
DO - 10.1145/3195970.3196101
M3 - Conference contribution
AN - SCOPUS:85053670482
SN - 9781450357005
T3 - Proceedings - Design Automation Conference
BT - Proceedings of the 55th Annual Design Automation Conference, DAC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 24 June 2018 through 29 June 2018
ER -