Abstract

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

Original languageEnglish (US)
Title of host publicationFPL 2016 - 26th International Conference on Field-Programmable Logic and Applications
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9782839918442
DOIs
StatePublished - Sep 26 2016
Event26th International Conference on Field-Programmable Logic and Applications, FPL 2016 - Lausanne, Switzerland
Duration: Aug 29 2016Sep 2 2016

Publication series

NameFPL 2016 - 26th International Conference on Field-Programmable Logic and Applications

Other

Other26th International Conference on Field-Programmable Logic and Applications, FPL 2016
CountrySwitzerland
CityLausanne
Period8/29/169/2/16

Keywords

  • Convolutional neural networks
  • FPGA
  • hardware acceleration

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Control and Optimization

Fingerprint Dive into the research topics of 'Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA'. Together they form a unique fingerprint.

  • Cite this

    Ma, Y., Suda, N., Cao, Y., Seo, J., & Vrudhula, S. (2016). Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications [7577356] (FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/FPL.2016.7577356