Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks

Naveen Suda; Vikas Chandra; Ganesh Dasika; Abinash Mohanty; Yufei Ma; Sarma Vrudhula; Jae-sun Seo; Yu Cao

doi:10.1145/2847263.2847276

Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks

Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu Cao

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

445 Scopus citations

Abstract

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute- /memory-intensive, it is difficult to perform real-time classification with low power consumption on today's computing systems. FPGAS have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAS are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

Original language	English (US)
Title of host publication	FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Publisher	Association for Computing Machinery, Inc
Pages	16-25
Number of pages	10
ISBN (Electronic)	9781450338561
DOIs	https://doi.org/10.1145/2847263.2847276
State	Published - Feb 21 2016
Event	2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2016 - Monterey, United States Duration: Feb 21 2016 → Feb 23 2016

Publication series

Name	FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Conference

Conference	2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2016
Country/Territory	United States
City	Monterey
Period	2/21/16 → 2/23/16

Keywords

Convolutional neural networks
FPGA
OpenCL
Optimization

ASJC Scopus subject areas

Hardware and Architecture
Electrical and Electronic Engineering

Access to Document

10.1145/2847263.2847276

Cite this

Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula, S., Seo, J., & Cao, Y. (2016). Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks. In FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 16-25). (FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays). Association for Computing Machinery, Inc. https://doi.org/10.1145/2847263.2847276

Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks. / Suda, Naveen; Chandra, Vikas; Dasika, Ganesh et al.
FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, Inc, 2016. p. 16-25 (FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Suda, N, Chandra, V, Dasika, G, Mohanty, A, Ma, Y, Vrudhula, S, Seo, J & Cao, Y 2016, Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks. in FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Association for Computing Machinery, Inc, pp. 16-25, 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2016, Monterey, United States, 2/21/16. https://doi.org/10.1145/2847263.2847276

Suda N, Chandra V, Dasika G, Mohanty A, Ma Y, Vrudhula S et al. Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks. In FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, Inc. 2016. p. 16-25. (FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays). doi: 10.1145/2847263.2847276

Suda, Naveen ; Chandra, Vikas ; Dasika, Ganesh et al. / Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks. FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, Inc, 2016. pp. 16-25 (FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays).

@inproceedings{d0cf89e1a7a24c5095ff291a5e9341a5,

title = "Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks",

abstract = "Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute- /memory-intensive, it is difficult to perform real-time classification with low power consumption on today's computing systems. FPGAS have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAS are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.",

keywords = "Convolutional neural networks, FPGA, OpenCL, Optimization",

author = "Naveen Suda and Vikas Chandra and Ganesh Dasika and Abinash Mohanty and Yufei Ma and Sarma Vrudhula and Jae-sun Seo and Yu Cao",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2016 ; Conference date: 21-02-2016 Through 23-02-2016",

year = "2016",

month = feb,

day = "21",

doi = "10.1145/2847263.2847276",

language = "English (US)",

series = "FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays",

publisher = "Association for Computing Machinery, Inc",

pages = "16--25",

booktitle = "FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays",

}

TY - GEN

T1 - Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks

AU - Suda, Naveen

AU - Chandra, Vikas

AU - Dasika, Ganesh

AU - Mohanty, Abinash

AU - Ma, Yufei

AU - Vrudhula, Sarma

AU - Seo, Jae-sun

AU - Cao, Yu

PY - 2016/2/21

Y1 - 2016/2/21

N2 - Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute- /memory-intensive, it is difficult to perform real-time classification with low power consumption on today's computing systems. FPGAS have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAS are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

AB - Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute- /memory-intensive, it is difficult to perform real-time classification with low power consumption on today's computing systems. FPGAS have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAS are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

KW - Convolutional neural networks

KW - FPGA

KW - OpenCL

KW - Optimization

UR - http://www.scopus.com/inward/record.url?scp=84966471227&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84966471227&partnerID=8YFLogxK

U2 - 10.1145/2847263.2847276

DO - 10.1145/2847263.2847276

M3 - Conference contribution

AN - SCOPUS:84966471227

T3 - FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

SP - 16

EP - 25

BT - FPGA 2016 - Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

PB - Association for Computing Machinery, Inc

T2 - 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2016

Y2 - 21 February 2016 through 23 February 2016

ER -

Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this