FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory

Shreyas K. Venkataramanaiah, Han Sok Suh, Shihui Yin, Eriko Nurvitadhi, Aravind Dasu, Yu Cao, Jae Sun Seo

Research output: Contribution to journalConference articlepeer-review

Abstract

Training convolutional neural networks (CNNs) requires intensive computations as well as a large amount of storage and memory access. While low bandwidth off-chip memories in prior FPGA works have hindered the system-level performance, modern FPGAs offer high bandwidth memory (HBM2) that unlocks opportunities to improve the throughput/energy of FPGA-based CNN training. This paper presents a FPGA accelerator for CNN training which (1) uses HBM2 for efficient off-chip communication, and (2) supports various training operations (e.g. residual connections, stride-2 convolutions) for modern CNNs. We analyze the impact of HBM2 on CNN training workloads, provide a comprehensive comparison with DDR3, and present the strategies to efficiently use HBM2 features for enhanced CNN training performance. For training ResNet-20/VGG-like CNNs for CIFAR-10 dataset with low batch size of 2, the proposed CNN training accelerator on Intel Stratix-10 MX FPGA demonstrates 1.4/1.7X energy-efficiency improvement compared to Stratix-10 GX FPGA with DDR3 memory, and 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU.

Original languageEnglish (US)
Article number9256704
JournalIEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
Volume2020-November
DOIs
StatePublished - Nov 2 2020
Event39th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2020 - Virtual, San Diego, United States
Duration: Nov 2 2020Nov 5 2020

Keywords

  • backpropagation
  • Convolutional neural networks
  • FPGA
  • hardware accelerator
  • neural network training

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Graphics and Computer-Aided Design

Fingerprint Dive into the research topics of 'FPGA-based Low-Batch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory'. Together they form a unique fingerprint.

Cite this