Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

Yufei Ma; Naveen Suda; Yu Cao; Jae-sun Seo; Sarma Vrudhula

doi:10.1109/FPL.2016.7577356

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

136 Scopus citations

Abstract

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

Original language	English (US)
Title of host publication	FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9782839918442
DOIs	https://doi.org/10.1109/FPL.2016.7577356
State	Published - Sep 26 2016
Event	26th International Conference on Field-Programmable Logic and Applications, FPL 2016 - Lausanne, Switzerland Duration: Aug 29 2016 → Sep 2 2016

Publication series

Name	FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications

Other

Other	26th International Conference on Field-Programmable Logic and Applications, FPL 2016
Country/Territory	Switzerland
City	Lausanne
Period	8/29/16 → 9/2/16

Keywords

Convolutional neural networks
FPGA
hardware acceleration

ASJC Scopus subject areas

Computer Networks and Communications
Computer Science Applications
Control and Optimization

Access to Document

10.1109/FPL.2016.7577356

Cite this

Ma, Y., Suda, N., Cao, Y., Seo, J., & Vrudhula, S. (2016). Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications Article 7577356 (FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/FPL.2016.7577356

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. / Ma, Yufei; Suda, Naveen; Cao, Yu et al.
FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications. Institute of Electrical and Electronics Engineers Inc., 2016. 7577356 (FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Ma, Y, Suda, N, Cao, Y, Seo, J & Vrudhula, S 2016, Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. in FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications., 7577356, FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications, Institute of Electrical and Electronics Engineers Inc., 26th International Conference on Field-Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland, 8/29/16. https://doi.org/10.1109/FPL.2016.7577356

Ma Y, Suda N, Cao Y, Seo J, Vrudhula S. Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications. Institute of Electrical and Electronics Engineers Inc. 2016. 7577356. (FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications). doi: 10.1109/FPL.2016.7577356

@inproceedings{0cb606c39c6d4e958f570264fea9ce1a,

title = "Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA",

abstract = "Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.",

keywords = "Convolutional neural networks, FPGA, hardware acceleration",

author = "Yufei Ma and Naveen Suda and Yu Cao and Jae-sun Seo and Sarma Vrudhula",

note = "Publisher Copyright: {\textcopyright} 2016 EPFL.; 26th International Conference on Field-Programmable Logic and Applications, FPL 2016 ; Conference date: 29-08-2016 Through 02-09-2016",

year = "2016",

month = sep,

day = "26",

doi = "10.1109/FPL.2016.7577356",

language = "English (US)",

series = "FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications",

}

TY - GEN

T1 - Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

AU - Ma, Yufei

AU - Suda, Naveen

AU - Cao, Yu

AU - Seo, Jae-sun

AU - Vrudhula, Sarma

PY - 2016/9/26

Y1 - 2016/9/26

N2 - Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

AB - Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

KW - Convolutional neural networks

KW - FPGA

KW - hardware acceleration

UR - http://www.scopus.com/inward/record.url?scp=84994891079&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994891079&partnerID=8YFLogxK

U2 - 10.1109/FPL.2016.7577356

DO - 10.1109/FPL.2016.7577356

M3 - Conference contribution

AN - SCOPUS:84994891079

T3 - FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications

BT - FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 26th International Conference on Field-Programmable Logic and Applications, FPL 2016

Y2 - 29 August 2016 through 2 September 2016

ER -

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this