Automatic compiler based FPGA accelerator for CNN training

Shreyas Kolala Venkataramanaiah; Yufei Ma; Shihui Yin; Eriko Nurvithadhi; Aravind Dasu; Yu Cao; Jae Sun Seo

doi:10.1109/FPL.2019.00034

Automatic compiler based FPGA accelerator for CNN training

Shreyas Kolala Venkataramanaiah, Yufei Ma, Shihui Yin, Eriko Nurvithadhi, Aravind Dasu, Yu Cao, Jae Sun Seo

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

36 Scopus citations

Abstract

Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hardware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler based FPGA accelerator with 16-bit fixed-point precision for complete CNN training, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAM to efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10 GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance.

Original language	English (US)
Title of host publication	Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019
Editors	Ioannis Sourdis, Christos-Savvas Bouganis, Carlos Alvarez, Leonel Antonio Toledo Diaz, Pedro Valero, Xavier Martorell
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	166-172
Number of pages	7
ISBN (Electronic)	9781728148847
DOIs	https://doi.org/10.1109/FPL.2019.00034
State	Published - Sep 2019
Event	29th International Conferenceon Field-Programmable Logic and Applications, FPL 2019 - Barcelona, Spain Duration: Sep 9 2019 → Sep 13 2019

Publication series

Name	Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019

Conference

Conference	29th International Conferenceon Field-Programmable Logic and Applications, FPL 2019
Country/Territory	Spain
City	Barcelona
Period	9/9/19 → 9/13/19

Keywords

Back-propagation
Convolution neural networks
FPGA
Hardware accelerator
Neural network training

ASJC Scopus subject areas

Instrumentation
Artificial Intelligence
Computer Science Applications
Hardware and Architecture

Access to Document

10.1109/FPL.2019.00034

Cite this

Venkataramanaiah, S. K., Ma, Y., Yin, S., Nurvithadhi, E., Dasu, A., Cao, Y., & Seo, J. S. (2019). Automatic compiler based FPGA accelerator for CNN training. In I. Sourdis, C.-S. Bouganis, C. Alvarez, L. A. Toledo Diaz, P. Valero, & X. Martorell (Eds.), Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019 (pp. 166-172). Article 8892195 (Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/FPL.2019.00034

Automatic compiler based FPGA accelerator for CNN training. / Venkataramanaiah, Shreyas Kolala; Ma, Yufei; Yin, Shihui et al.
Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019. ed. / Ioannis Sourdis; Christos-Savvas Bouganis; Carlos Alvarez; Leonel Antonio Toledo Diaz; Pedro Valero; Xavier Martorell. Institute of Electrical and Electronics Engineers Inc., 2019. p. 166-172 8892195 (Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Venkataramanaiah, SK, Ma, Y, Yin, S, Nurvithadhi, E, Dasu, A, Cao, Y & Seo, JS 2019, Automatic compiler based FPGA accelerator for CNN training. in I Sourdis, C-S Bouganis, C Alvarez, LA Toledo Diaz, P Valero & X Martorell (eds), Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019., 8892195, Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019, Institute of Electrical and Electronics Engineers Inc., pp. 166-172, 29th International Conferenceon Field-Programmable Logic and Applications, FPL 2019, Barcelona, Spain, 9/9/19. https://doi.org/10.1109/FPL.2019.00034

Venkataramanaiah SK, Ma Y, Yin S, Nurvithadhi E, Dasu A, Cao Y et al. Automatic compiler based FPGA accelerator for CNN training. In Sourdis I, Bouganis CS, Alvarez C, Toledo Diaz LA, Valero P, Martorell X, editors, Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 166-172. 8892195. (Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019). doi: 10.1109/FPL.2019.00034

Venkataramanaiah, Shreyas Kolala ; Ma, Yufei ; Yin, Shihui et al. / Automatic compiler based FPGA accelerator for CNN training. Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019. editor / Ioannis Sourdis ; Christos-Savvas Bouganis ; Carlos Alvarez ; Leonel Antonio Toledo Diaz ; Pedro Valero ; Xavier Martorell. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 166-172 (Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019).

@inproceedings{3584fefc29d442ffa4abf7f5679f16da,

title = "Automatic compiler based FPGA accelerator for CNN training",

abstract = "Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hardware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler based FPGA accelerator with 16-bit fixed-point precision for complete CNN training, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAM to efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10 GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance.",

keywords = "Back-propagation, Convolution neural networks, FPGA, Hardware accelerator, Neural network training",

author = "Venkataramanaiah, {Shreyas Kolala} and Yufei Ma and Shihui Yin and Eriko Nurvithadhi and Aravind Dasu and Yu Cao and Seo, {Jae Sun}",

note = "Funding Information: The authors would like to thank Intel Corporation for supporting and funding this research work. This work was also partially supported by NSF grant 1652866 and C-BRIC, one of six centers in JUMP, a SRC program sponsored by DARPA. Publisher Copyright: {\textcopyright} 2019 IEEE.; 29th International Conferenceon Field-Programmable Logic and Applications, FPL 2019 ; Conference date: 09-09-2019 Through 13-09-2019",

year = "2019",

month = sep,

doi = "10.1109/FPL.2019.00034",

language = "English (US)",

series = "Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "166--172",

editor = "Ioannis Sourdis and Christos-Savvas Bouganis and Carlos Alvarez and {Toledo Diaz}, {Leonel Antonio} and Pedro Valero and Xavier Martorell",

booktitle = "Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019",

}

TY - GEN

T1 - Automatic compiler based FPGA accelerator for CNN training

AU - Venkataramanaiah, Shreyas Kolala

AU - Ma, Yufei

AU - Yin, Shihui

AU - Nurvithadhi, Eriko

AU - Dasu, Aravind

AU - Cao, Yu

AU - Seo, Jae Sun

N1 - Funding Information: The authors would like to thank Intel Corporation for supporting and funding this research work. This work was also partially supported by NSF grant 1652866 and C-BRIC, one of six centers in JUMP, a SRC program sponsored by DARPA. Publisher Copyright: © 2019 IEEE.

PY - 2019/9

Y1 - 2019/9

N2 - Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hardware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler based FPGA accelerator with 16-bit fixed-point precision for complete CNN training, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAM to efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10 GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance.

AB - Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hardware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler based FPGA accelerator with 16-bit fixed-point precision for complete CNN training, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAM to efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10 GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance.

KW - Back-propagation

KW - Convolution neural networks

KW - FPGA

KW - Hardware accelerator

KW - Neural network training

UR - http://www.scopus.com/inward/record.url?scp=85075638514&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85075638514&partnerID=8YFLogxK

U2 - 10.1109/FPL.2019.00034

DO - 10.1109/FPL.2019.00034

M3 - Conference contribution

AN - SCOPUS:85075638514

T3 - Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019

SP - 166

EP - 172

BT - Proceedings - 29th International Conference on Field-Programmable Logic and Applications, FPL 2019

A2 - Sourdis, Ioannis

A2 - Bouganis, Christos-Savvas

A2 - Alvarez, Carlos

A2 - Toledo Diaz, Leonel Antonio

A2 - Valero, Pedro

A2 - Martorell, Xavier

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 29th International Conferenceon Field-Programmable Logic and Applications, FPL 2019

Y2 - 9 September 2019 through 13 September 2019

ER -

Automatic compiler based FPGA accelerator for CNN training

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this