ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler

Yufei Ma; Naveen Suda; Yu Cao; Sarma Vrudhula; Jae-sun Seo

doi:10.1016/j.vlsi.2017.12.009

ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler

Yufei Ma, Naveen Suda, Yu Cao, Sarma Vrudhula, Jae-sun Seo

Research output: Contribution to journal › Review article › peer-review

76 Scopus citations

Abstract

Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

Original language	English (US)
Pages (from-to)	14-23
Number of pages	10
Journal	Integration
Volume	62
DOIs	https://doi.org/10.1016/j.vlsi.2017.12.009
State	Published - Jun 2018

Keywords

Compiler
Convolutional neural networks
FPGA
Hardware acceleration
RTL

ASJC Scopus subject areas

Software
Hardware and Architecture
Electrical and Electronic Engineering

Access to Document

10.1016/j.vlsi.2017.12.009

Cite this

@article{ed86f3e9d7c742769434eefec4138be5,

title = "ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler",

abstract = "Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.",

keywords = "Compiler, Convolutional neural networks, FPGA, Hardware acceleration, RTL",

author = "Yufei Ma and Naveen Suda and Yu Cao and Sarma Vrudhula and Jae-sun Seo",

note = "Funding Information: In this paper, ALAMO RTL compiler is proposed to accelerate CNNs on FPGA platforms, where the computing primitives could be easily compiled from the parametrized hardware library. Representative CNN algorithms of AlexNet and NiN have been demonstrated on an Altera Stratix-V FPGA board, which show an end-to-end throughput of 114.5 GOPS and 117.3 GOPS, resulting in 1.9X improvement compared to an optimized OpenCL design on the same FPGA board. Future work includes adopting techniques in Ref. [ 31 , 32 ] to increase the compiler's generality and efficiency of data and weight transfer for larger state-of-the-art CNN models [ 6 , 25 , 26 ]. A cknowledgment This work was in part supported by the National Science Foundation within the Directorate for Engineering under Grants 1230401 and 1237856 , the NSF I/UCRC Center for Embedded Systems through NSF grants 1361926 and 1432348 , NSF grant 1652866 , and Samsung Advanced Institute of Technology. Publisher Copyright: {\textcopyright} 2017 Elsevier B.V.",

year = "2018",

month = jun,

doi = "10.1016/j.vlsi.2017.12.009",

language = "English (US)",

volume = "62",

pages = "14--23",

journal = "Integration",

issn = "0167-9260",

publisher = "Elsevier",

}

TY - JOUR

T1 - ALAMO

T2 - FPGA acceleration of deep learning algorithms with a modularized RTL compiler

AU - Ma, Yufei

AU - Suda, Naveen

AU - Cao, Yu

AU - Vrudhula, Sarma

AU - Seo, Jae-sun

N1 - Funding Information: In this paper, ALAMO RTL compiler is proposed to accelerate CNNs on FPGA platforms, where the computing primitives could be easily compiled from the parametrized hardware library. Representative CNN algorithms of AlexNet and NiN have been demonstrated on an Altera Stratix-V FPGA board, which show an end-to-end throughput of 114.5 GOPS and 117.3 GOPS, resulting in 1.9X improvement compared to an optimized OpenCL design on the same FPGA board. Future work includes adopting techniques in Ref. [ 31 , 32 ] to increase the compiler's generality and efficiency of data and weight transfer for larger state-of-the-art CNN models [ 6 , 25 , 26 ]. A cknowledgment This work was in part supported by the National Science Foundation within the Directorate for Engineering under Grants 1230401 and 1237856 , the NSF I/UCRC Center for Embedded Systems through NSF grants 1361926 and 1432348 , NSF grant 1652866 , and Samsung Advanced Institute of Technology. Publisher Copyright: © 2017 Elsevier B.V.

PY - 2018/6

Y1 - 2018/6

N2 - Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

AB - Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

KW - Compiler

KW - Convolutional neural networks

KW - FPGA

KW - Hardware acceleration

KW - RTL

UR - http://www.scopus.com/inward/record.url?scp=85039998468&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039998468&partnerID=8YFLogxK

U2 - 10.1016/j.vlsi.2017.12.009

DO - 10.1016/j.vlsi.2017.12.009

M3 - Review article

AN - SCOPUS:85039998468

SN - 0167-9260

VL - 62

SP - 14

EP - 23

JO - Integration

JF - Integration

ER -

ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this