TY - GEN
T1 - Automatic compiler based FPGA accelerator for CNN training
AU - Venkataramanaiah, Shreyas Kolala
AU - Ma, Yufei
AU - Yin, Shihui
AU - Nurvithadhi, Eriko
AU - Dasu, Aravind
AU - Cao, Yu
AU - Seo, Jae Sun
N1 - Funding Information:
The authors would like to thank Intel Corporation for supporting and funding this research work. This work was also partially supported by NSF grant 1652866 and C-BRIC, one of six centers in JUMP, a SRC program sponsored by DARPA.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hardware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler based FPGA accelerator with 16-bit fixed-point precision for complete CNN training, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAM to efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10 GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance.
AB - Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hardware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler based FPGA accelerator with 16-bit fixed-point precision for complete CNN training, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAM to efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10 GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance.
KW - Back-propagation
KW - Convolution neural networks
KW - FPGA
KW - Hardware accelerator
KW - Neural network training
UR - http://www.scopus.com/inward/record.url?scp=85075638514&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075638514&partnerID=8YFLogxK
U2 - 10.1109/FPL.2019.00034
DO - 10.1109/FPL.2019.00034
M3 - Conference contribution
AN - SCOPUS:85075638514
T3 - Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019
SP - 166
EP - 172
BT - Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019
A2 - Sourdis, Ioannis
A2 - Bouganis, Christos-Savvas
A2 - Alvarez, Carlos
A2 - Toledo Diaz, Leonel Antonio
A2 - Valero, Pedro
A2 - Martorell, Xavier
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th International Conferenceon Field-Programmable Logic and Applications, FPL 2019
Y2 - 9 September 2019 through 13 September 2019
ER -