FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory

Shreyas K. Venkataramanaiah; Han Sok Suh; Shihui Yin; Eriko Nurvitadhi; Aravind Dasu; Yu Cao; Jae Sun Seo

doi:10.1145/3400302.3415643

FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory

Shreyas K. Venkataramanaiah, Han Sok Suh, Shihui Yin, Eriko Nurvitadhi, Aravind Dasu, Yu Cao, Jae Sun Seo

Electrical, Computer, and Energy Engineering, School of (IAFSE-ECEE)

Research output: Contribution to journal › Conference article › peer-review

19 Scopus citations

Abstract

Training convolutional neural networks (CNNs) requires intensive computations as well as a large amount of storage and memory access. While low bandwidth off-chip memories in prior FPGA works have hindered the system-level performance, modern FPGAs offer high bandwidth memory (HBM2) that unlocks opportunities to improve the throughput/energy of FPGA-based CNN training. This paper presents a FPGA accelerator for CNN training which (1) uses HBM2 for efficient off-chip communication, and (2) supports various training operations (e.g. residual connections, stride-2 convolutions) for modern CNNs. We analyze the impact of HBM2 on CNN training workloads, provide a comprehensive comparison with DDR3, and present the strategies to efficiently use HBM2 features for enhanced CNN training performance. For training ResNet-20/VGG-like CNNs for CIFAR-10 dataset with low batch size of 2, the proposed CNN training accelerator on Intel Stratix-10 MX FPGA demonstrates 1.4/1.7X energy-efficiency improvement compared to Stratix-10 GX FPGA with DDR3 memory, and 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU.

Original language	English (US)
Article number	9256704
Journal	IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
Volume	2020-November
DOIs	https://doi.org/10.1145/3400302.3415643
State	Published - Nov 2 2020
Event	39th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2020 - Virtual, San Diego, United States Duration: Nov 2 2020 → Nov 5 2020

Keywords

Convolutional neural networks
FPGA
backpropagation
hardware accelerator
neural network training

ASJC Scopus subject areas

Software
Computer Science Applications
Computer Graphics and Computer-Aided Design

Access to Document

10.1145/3400302.3415643

Cite this

FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory. / Venkataramanaiah, Shreyas K.; Suh, Han Sok; Yin, Shihui et al.
In: IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD, Vol. 2020-November, 9256704, 02.11.2020.

Research output: Contribution to journal › Conference article › peer-review

@article{a59dcb3516244d6aa7aa50ebaeefd388,

title = "FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory",

abstract = "Training convolutional neural networks (CNNs) requires intensive computations as well as a large amount of storage and memory access. While low bandwidth off-chip memories in prior FPGA works have hindered the system-level performance, modern FPGAs offer high bandwidth memory (HBM2) that unlocks opportunities to improve the throughput/energy of FPGA-based CNN training. This paper presents a FPGA accelerator for CNN training which (1) uses HBM2 for efficient off-chip communication, and (2) supports various training operations (e.g. residual connections, stride-2 convolutions) for modern CNNs. We analyze the impact of HBM2 on CNN training workloads, provide a comprehensive comparison with DDR3, and present the strategies to efficiently use HBM2 features for enhanced CNN training performance. For training ResNet-20/VGG-like CNNs for CIFAR-10 dataset with low batch size of 2, the proposed CNN training accelerator on Intel Stratix-10 MX FPGA demonstrates 1.4/1.7X energy-efficiency improvement compared to Stratix-10 GX FPGA with DDR3 memory, and 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU.",

keywords = "Convolutional neural networks, FPGA, backpropagation, hardware accelerator, neural network training",

author = "Venkataramanaiah, {Shreyas K.} and Suh, {Han Sok} and Shihui Yin and Eriko Nurvitadhi and Aravind Dasu and Yu Cao and Seo, {Jae Sun}",

note = "Publisher Copyright: {\textcopyright} 2020 Association on Computer Machinery.; 39th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2020 ; Conference date: 02-11-2020 Through 05-11-2020",

year = "2020",

month = nov,

day = "2",

doi = "10.1145/3400302.3415643",

language = "English (US)",

volume = "2020-November",

journal = "IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD",

issn = "1092-3152",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory

AU - Venkataramanaiah, Shreyas K.

AU - Suh, Han Sok

AU - Yin, Shihui

AU - Nurvitadhi, Eriko

AU - Dasu, Aravind

AU - Cao, Yu

AU - Seo, Jae Sun

PY - 2020/11/2

Y1 - 2020/11/2

N2 - Training convolutional neural networks (CNNs) requires intensive computations as well as a large amount of storage and memory access. While low bandwidth off-chip memories in prior FPGA works have hindered the system-level performance, modern FPGAs offer high bandwidth memory (HBM2) that unlocks opportunities to improve the throughput/energy of FPGA-based CNN training. This paper presents a FPGA accelerator for CNN training which (1) uses HBM2 for efficient off-chip communication, and (2) supports various training operations (e.g. residual connections, stride-2 convolutions) for modern CNNs. We analyze the impact of HBM2 on CNN training workloads, provide a comprehensive comparison with DDR3, and present the strategies to efficiently use HBM2 features for enhanced CNN training performance. For training ResNet-20/VGG-like CNNs for CIFAR-10 dataset with low batch size of 2, the proposed CNN training accelerator on Intel Stratix-10 MX FPGA demonstrates 1.4/1.7X energy-efficiency improvement compared to Stratix-10 GX FPGA with DDR3 memory, and 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU.

AB - Training convolutional neural networks (CNNs) requires intensive computations as well as a large amount of storage and memory access. While low bandwidth off-chip memories in prior FPGA works have hindered the system-level performance, modern FPGAs offer high bandwidth memory (HBM2) that unlocks opportunities to improve the throughput/energy of FPGA-based CNN training. This paper presents a FPGA accelerator for CNN training which (1) uses HBM2 for efficient off-chip communication, and (2) supports various training operations (e.g. residual connections, stride-2 convolutions) for modern CNNs. We analyze the impact of HBM2 on CNN training workloads, provide a comprehensive comparison with DDR3, and present the strategies to efficiently use HBM2 features for enhanced CNN training performance. For training ResNet-20/VGG-like CNNs for CIFAR-10 dataset with low batch size of 2, the proposed CNN training accelerator on Intel Stratix-10 MX FPGA demonstrates 1.4/1.7X energy-efficiency improvement compared to Stratix-10 GX FPGA with DDR3 memory, and 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU.

KW - Convolutional neural networks

KW - FPGA

KW - backpropagation

KW - hardware accelerator

KW - neural network training

UR - http://www.scopus.com/inward/record.url?scp=85097961373&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85097961373&partnerID=8YFLogxK

U2 - 10.1145/3400302.3415643

DO - 10.1145/3400302.3415643

M3 - Conference article

AN - SCOPUS:85097961373

SN - 1092-3152

VL - 2020-November

JO - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD

JF - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD

M1 - 9256704

T2 - 39th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2020

Y2 - 2 November 2020 through 5 November 2020

ER -

FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this