A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks

Yixing Li; Zichuan Liu; Kai Xu; Hao Yu; Fengbo Ren

doi:10.1145/3154839

A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks

Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, Fengbo Ren

Research output: Contribution to journal › Article › peer-review

46 Scopus citations

Abstract

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this article, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3× faster and 75× more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5× higher energy efficiency.

Original language	English (US)
Article number	3154839
Journal	ACM Journal on Emerging Technologies in Computing Systems
Volume	14
Issue number	2
DOIs	https://doi.org/10.1145/3154839
State	Published - Jul 2018

Keywords

Binary neural network
Convolutional neural network
Deep learning
Energy efficiency
FPGA
Hardware acceleration
High-throughput

ASJC Scopus subject areas

Software
Hardware and Architecture
Electrical and Electronic Engineering

Access to Document

10.1145/3154839

Cite this

@article{0cfdaac69f6840a2b3b9308d5748287d,

title = "A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks",

abstract = "FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this article, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3× faster and 75× more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5× higher energy efficiency.",

keywords = "Binary neural network, Convolutional neural network, Deep learning, Energy efficiency, FPGA, Hardware acceleration, High-throughput",

author = "Yixing Li and Zichuan Liu and Kai Xu and Hao Yu and Fengbo Ren",

note = "Funding Information: This work, by Arizona State University and Nanyang Technological University, was supported by Cisco Research Center (CG#594589) and Singapore MOE Tier-2 (MOE2015-T2-2-013), respectively. Authors{\textquoteright} addresses: Y. Li, K. Xu, and F. Ren, 699 S. Mill Avenue, # 553, Tempe, AZ 85281, USA; emails: {yixingli, kaixu, renfengbo}@asu.edu; Z. Liu, 50 Nanyang Ave, Singapore, 639798; email: zliu016@e.ntu.edu.sg; H. Yu, EE Department 1088 Xueyuan Rd., Shenzhen, 518055, China; email: yuh3@sustc.edu.cn. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. 2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 1550-4832/2018/07-ART18 $15.00 https://doi.org/10.1145/3154839 Funding Information: This work, by Arizona State University and Nanyang Technological University, was supported by Cisco Research Center (CG#594589) and Singapore MOE Tier-2 (MOE2015-T2-2-013), respectively. We acknowledge Mr. Skip Booth and Mr. Hugo Latapie from Cisco for fruitful research discussions. We also thank the Xilinx University Program for donating the FPGA boards. Publisher Copyright: {\textcopyright} 2018 ACM.",

year = "2018",

month = jul,

doi = "10.1145/3154839",

language = "English (US)",

volume = "14",

journal = "ACM Journal on Emerging Technologies in Computing Systems",

issn = "1550-4832",

publisher = "Association for Computing Machinery (ACM)",

number = "2",

}

TY - JOUR

T1 - A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks

AU - Li, Yixing

AU - Liu, Zichuan

AU - Xu, Kai

AU - Yu, Hao

AU - Ren, Fengbo

N1 - Funding Information: This work, by Arizona State University and Nanyang Technological University, was supported by Cisco Research Center (CG#594589) and Singapore MOE Tier-2 (MOE2015-T2-2-013), respectively. Authors’ addresses: Y. Li, K. Xu, and F. Ren, 699 S. Mill Avenue, # 553, Tempe, AZ 85281, USA; emails: {yixingli, kaixu, renfengbo}@asu.edu; Z. Liu, 50 Nanyang Ave, Singapore, 639798; email: zliu016@e.ntu.edu.sg; H. Yu, EE Department 1088 Xueyuan Rd., Shenzhen, 518055, China; email: yuh3@sustc.edu.cn. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. 2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 1550-4832/2018/07-ART18 $15.00 https://doi.org/10.1145/3154839 Funding Information: This work, by Arizona State University and Nanyang Technological University, was supported by Cisco Research Center (CG#594589) and Singapore MOE Tier-2 (MOE2015-T2-2-013), respectively. We acknowledge Mr. Skip Booth and Mr. Hugo Latapie from Cisco for fruitful research discussions. We also thank the Xilinx University Program for donating the FPGA boards. Publisher Copyright: © 2018 ACM.

PY - 2018/7

Y1 - 2018/7

N2 - FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this article, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3× faster and 75× more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5× higher energy efficiency.

AB - FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this article, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3× faster and 75× more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5× higher energy efficiency.

KW - Binary neural network

KW - Convolutional neural network

KW - Deep learning

KW - Energy efficiency

KW - FPGA

KW - Hardware acceleration

KW - High-throughput

UR - http://www.scopus.com/inward/record.url?scp=85053277573&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85053277573&partnerID=8YFLogxK

U2 - 10.1145/3154839

DO - 10.1145/3154839

M3 - Article

AN - SCOPUS:85053277573

SN - 1550-4832

VL - 14

JO - ACM Journal on Emerging Technologies in Computing Systems

JF - ACM Journal on Emerging Technologies in Computing Systems

IS - 2

M1 - 3154839

ER -

A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this