43 Citations (Scopus)

Abstract

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

Original languageEnglish (US)
Title of host publicationFPL 2016 - 26th International Conference on Field-Programmable Logic and Applications
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9782839918442
DOIs
StatePublished - Sep 26 2016
Event26th International Conference on Field-Programmable Logic and Applications, FPL 2016 - Lausanne, Switzerland
Duration: Aug 29 2016Sep 2 2016

Other

Other26th International Conference on Field-Programmable Logic and Applications, FPL 2016
CountrySwitzerland
CityLausanne
Period8/29/169/2/16

Fingerprint

Compilation
Field Programmable Gate Array
Field programmable gate arrays (FPGA)
Neural Networks
Neural networks
High-level Synthesis
Throughput
Hardware
Neural Network Model
Compiler
Hardware Acceleration
Resource Constraints
Large Data
System Architecture
Network Structure
Learning algorithms
Accelerate
Parallelism
Learning Algorithm
Maximise

Keywords

  • Convolutional neural networks
  • FPGA
  • hardware acceleration

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Control and Optimization

Cite this

Ma, Y., Suda, N., Cao, Y., Seo, J., & Vrudhula, S. (2016). Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications [7577356] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/FPL.2016.7577356

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. / Ma, Yufei; Suda, Naveen; Cao, Yu; Seo, Jae-sun; Vrudhula, Sarma.

FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications. Institute of Electrical and Electronics Engineers Inc., 2016. 7577356.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ma, Y, Suda, N, Cao, Y, Seo, J & Vrudhula, S 2016, Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. in FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications., 7577356, Institute of Electrical and Electronics Engineers Inc., 26th International Conference on Field-Programmable Logic and Applications, FPL 2016, Lausanne, Switzerland, 8/29/16. https://doi.org/10.1109/FPL.2016.7577356
Ma Y, Suda N, Cao Y, Seo J, Vrudhula S. Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications. Institute of Electrical and Electronics Engineers Inc. 2016. 7577356 https://doi.org/10.1109/FPL.2016.7577356
Ma, Yufei ; Suda, Naveen ; Cao, Yu ; Seo, Jae-sun ; Vrudhula, Sarma. / Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications. Institute of Electrical and Electronics Engineers Inc., 2016.
@inproceedings{0cb606c39c6d4e958f570264fea9ce1a,
title = "Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA",
abstract = "Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.",
keywords = "Convolutional neural networks, FPGA, hardware acceleration",
author = "Yufei Ma and Naveen Suda and Yu Cao and Jae-sun Seo and Sarma Vrudhula",
year = "2016",
month = "9",
day = "26",
doi = "10.1109/FPL.2016.7577356",
language = "English (US)",
booktitle = "FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

AU - Ma, Yufei

AU - Suda, Naveen

AU - Cao, Yu

AU - Seo, Jae-sun

AU - Vrudhula, Sarma

PY - 2016/9/26

Y1 - 2016/9/26

N2 - Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

AB - Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

KW - Convolutional neural networks

KW - FPGA

KW - hardware acceleration

UR - http://www.scopus.com/inward/record.url?scp=84994891079&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994891079&partnerID=8YFLogxK

U2 - 10.1109/FPL.2016.7577356

DO - 10.1109/FPL.2016.7577356

M3 - Conference contribution

BT - FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications

PB - Institute of Electrical and Electronics Engineers Inc.

ER -