TY - GEN
T1 - Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
AU - Ma, Yufei
AU - Suda, Naveen
AU - Cao, Yu
AU - Seo, Jae-sun
AU - Vrudhula, Sarma
N1 - Publisher Copyright:
© 2016 EPFL.
PY - 2016/9/26
Y1 - 2016/9/26
N2 - Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
AB - Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
KW - Convolutional neural networks
KW - FPGA
KW - hardware acceleration
UR - http://www.scopus.com/inward/record.url?scp=84994891079&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84994891079&partnerID=8YFLogxK
U2 - 10.1109/FPL.2016.7577356
DO - 10.1109/FPL.2016.7577356
M3 - Conference contribution
AN - SCOPUS:84994891079
T3 - FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications
BT - FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th International Conference on Field-Programmable Logic and Applications, FPL 2016
Y2 - 29 August 2016 through 2 September 2016
ER -