TY - GEN
T1 - End-to-end scalable FPGA accelerator for deep residual networks
AU - Ma, Yufei
AU - Kim, Minkyu
AU - Cao, Yu
AU - Vrudhula, Sarma
AU - Seo, Jae-sun
N1 - Funding Information:
VII. ACKNOWLEDGEMENT This work was supported in part by the NSF I/UCRC Center for Embedded Systems through NSF grant 1361926 and 1535669, and Samsung Advanced Institute of Technology.
Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/25
Y1 - 2017/9/25
N2 - This work presents an efficient hardware accelerator design of deep residual learning algorithms, which have shown superior image recognition accuracy (>90% top-5 accuracy on ImageNet database). Two key objectives of the acceleration strategy are to (1) maximize resource utilization and minimize data movements, and (2) employ scalable and reusable computing primitives to optimize physical design under hardware constraints. Furthermore, we present techniques for efficient integration and communication of these primitives in deep residual convolutional neural networks (CNNs) that exhibit complex, non-uniform layer connections. The proposed hardware accelerator efficiently implements state-of-the-art ResNet-50/152 algorithms on Arria-10 FPGA, demonstrating 285.1/315.5 GOPS of throughput and 27.2/71.7 ms of latency, respectively.
AB - This work presents an efficient hardware accelerator design of deep residual learning algorithms, which have shown superior image recognition accuracy (>90% top-5 accuracy on ImageNet database). Two key objectives of the acceleration strategy are to (1) maximize resource utilization and minimize data movements, and (2) employ scalable and reusable computing primitives to optimize physical design under hardware constraints. Furthermore, we present techniques for efficient integration and communication of these primitives in deep residual convolutional neural networks (CNNs) that exhibit complex, non-uniform layer connections. The proposed hardware accelerator efficiently implements state-of-the-art ResNet-50/152 algorithms on Arria-10 FPGA, demonstrating 285.1/315.5 GOPS of throughput and 27.2/71.7 ms of latency, respectively.
KW - Convolutional neural networks
KW - Deep learning
KW - Deep residual networks
KW - FPGA
KW - hardware acceleration
UR - http://www.scopus.com/inward/record.url?scp=85032694855&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032694855&partnerID=8YFLogxK
U2 - 10.1109/ISCAS.2017.8050344
DO - 10.1109/ISCAS.2017.8050344
M3 - Conference contribution
AN - SCOPUS:85032694855
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
BT - IEEE International Symposium on Circuits and Systems
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 50th IEEE International Symposium on Circuits and Systems, ISCAS 2017
Y2 - 28 May 2017 through 31 May 2017
ER -