TY - JOUR
T1 - ALAMO
T2 - FPGA acceleration of deep learning algorithms with a modularized RTL compiler
AU - Ma, Yufei
AU - Suda, Naveen
AU - Cao, Yu
AU - Vrudhula, Sarma
AU - Seo, Jae-sun
N1 - Funding Information:
In this paper, ALAMO RTL compiler is proposed to accelerate CNNs on FPGA platforms, where the computing primitives could be easily compiled from the parametrized hardware library. Representative CNN algorithms of AlexNet and NiN have been demonstrated on an Altera Stratix-V FPGA board, which show an end-to-end throughput of 114.5 GOPS and 117.3 GOPS, resulting in 1.9X improvement compared to an optimized OpenCL design on the same FPGA board. Future work includes adopting techniques in Ref. [ 31 , 32 ] to increase the compiler's generality and efficiency of data and weight transfer for larger state-of-the-art CNN models [ 6 , 25 , 26 ]. A cknowledgment This work was in part supported by the National Science Foundation within the Directorate for Engineering under Grants 1230401 and 1237856 , the NSF I/UCRC Center for Embedded Systems through NSF grants 1361926 and 1432348 , NSF grant 1652866 , and Samsung Advanced Institute of Technology.
Publisher Copyright:
© 2017 Elsevier B.V.
PY - 2018/6
Y1 - 2018/6
N2 - Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
AB - Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
KW - Compiler
KW - Convolutional neural networks
KW - FPGA
KW - Hardware acceleration
KW - RTL
UR - http://www.scopus.com/inward/record.url?scp=85039998468&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039998468&partnerID=8YFLogxK
U2 - 10.1016/j.vlsi.2017.12.009
DO - 10.1016/j.vlsi.2017.12.009
M3 - Review article
AN - SCOPUS:85039998468
SN - 0167-9260
VL - 62
SP - 14
EP - 23
JO - Integration
JF - Integration
ER -