TY - GEN
T1 - Optimizing weight mapping and data flow for convolutional neural networks on rram based processing-in-memory architecture
AU - Peng, Xiaochen
AU - Liu, Rui
AU - Yu, Shimeng
N1 - Funding Information:
This work is supported by ASCENT, one of the SRC/DARPA JUMP centers, NSF-CCF-1903951, NSF-CCF-1740225, SRC Contract 2018-NC-2762 and Samsung.
Publisher Copyright:
© 2019 IEEE
PY - 2019
Y1 - 2019
N2 - Resistive random access memory (RRAM) based array architecture has been proposed for on-chip acceleration of convolutional neural networks (CNNs), where the array could be configured for dot-product computation in a parallel fashion by summing up the column currents. Prior processing-in-memory (PIM) designs unroll each 3D kernel of the convolutional layers into a vertical column of a large weight matrix, where the input data will be accessed multiple times. As a result, significant latency and energy are consumed in interconnect and buffer. In this paper, in order to maximize both weight and input data reuse for RRAM based PIM architecture, we propose a novel weight mapping method and the corresponding data flow which divides the kernels and assign the input data into different processing-elements (PEs) according to their spatial locations. The proposed design achieves ~65% save in latency and energy for interconnect and buffer, and yields overall 2.1× speed up and ~17% improvement in the energy efficiency in terms of TOPS/W for VGG-16 CNN, compared with the prior design based on the conventional mapping method.
AB - Resistive random access memory (RRAM) based array architecture has been proposed for on-chip acceleration of convolutional neural networks (CNNs), where the array could be configured for dot-product computation in a parallel fashion by summing up the column currents. Prior processing-in-memory (PIM) designs unroll each 3D kernel of the convolutional layers into a vertical column of a large weight matrix, where the input data will be accessed multiple times. As a result, significant latency and energy are consumed in interconnect and buffer. In this paper, in order to maximize both weight and input data reuse for RRAM based PIM architecture, we propose a novel weight mapping method and the corresponding data flow which divides the kernels and assign the input data into different processing-elements (PEs) according to their spatial locations. The proposed design achieves ~65% save in latency and energy for interconnect and buffer, and yields overall 2.1× speed up and ~17% improvement in the energy efficiency in terms of TOPS/W for VGG-16 CNN, compared with the prior design based on the conventional mapping method.
KW - Deep neural network
KW - Hardware accelerator
KW - Machine learning
KW - Non-volatile memory
KW - Processing-in-memory
UR - http://www.scopus.com/inward/record.url?scp=85066804151&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066804151&partnerID=8YFLogxK
U2 - 10.1109/ISCAS.2019.8702715
DO - 10.1109/ISCAS.2019.8702715
M3 - Conference contribution
AN - SCOPUS:85066804151
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
BT - 2019 IEEE International Symposium on Circuits and Systems, ISCAS 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE International Symposium on Circuits and Systems, ISCAS 2019
Y2 - 26 May 2019 through 29 May 2019
ER -