TY - GEN
T1 - Accelerating Linear Algebra Kernels on a Massively Parallel Reconfigurable Architecture
AU - Soorishetty, A.
AU - Zhou, J.
AU - Pal, S.
AU - Blaauw, D.
AU - Kim, H.
AU - Mudge, T.
AU - Dreslinski, R.
AU - Chakrabarti, C.
N1 - Funding Information:
Acknowledgment: The material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7864. The views and conclusions contained herein are those of the authors and do not represent the official policies or endorsements, either expressed or implied, of ARFL and DARPA or the U.S. Government.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - Much of the recent work on domain-specific architectures has focused on bridging the gap between performance/efficiency and programmability. We consider one such example architecture, Transformer, consisting of light-weight cores interconnected by caches and crossbars that supports run-time reconfiguration between shared and private cache mode operations. We present customized implementation of a select set of linear algebra kernels, namely, triangular matrix solver, LU decomposition, QR decomposition and matrix in-version, on Transformer. The performance of the kernel algorithms is evaluated with respect to execution time and energy efficiency. Our study shows that each kernel achieves high performance for a certain cache mode and that this cache mode can change when the matrix size changes, making a case for run-time reconfiguration.
AB - Much of the recent work on domain-specific architectures has focused on bridging the gap between performance/efficiency and programmability. We consider one such example architecture, Transformer, consisting of light-weight cores interconnected by caches and crossbars that supports run-time reconfiguration between shared and private cache mode operations. We present customized implementation of a select set of linear algebra kernels, namely, triangular matrix solver, LU decomposition, QR decomposition and matrix in-version, on Transformer. The performance of the kernel algorithms is evaluated with respect to execution time and energy efficiency. Our study shows that each kernel achieves high performance for a certain cache mode and that this cache mode can change when the matrix size changes, making a case for run-time reconfiguration.
UR - http://www.scopus.com/inward/record.url?scp=85089238825&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089238825&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9054126
DO - 10.1109/ICASSP40776.2020.9054126
M3 - Conference contribution
AN - SCOPUS:85089238825
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 1558
EP - 1562
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Y2 - 4 May 2020 through 8 May 2020
ER -