Although recent advances in resistive random access memory (ReRAM)-based accelerator designs for deep convolutional neural networks (CNNs) offer energy-efficiency improvements over CMOS-based accelerators, they have a large number of energy consuming data transactions. In this paper, we propose MAX2, a multi-tile ReRAM accelerator framework for supporting multiple CNN topologies, that maximizes on-chip data reuse and reduces on-chip bandwidth to minimize energy consumption due to data movement. Building upon the fact that a large filter can be built with a stack of smaller ( 3\times 3 ) filters, we design every tile with nine processing elements (PEs). Each PE consists of multiple ReRAM subarrays to compute the dot product. The PEs operate in a systolic fashion, thereby maximizing input feature map reuse and minimizing interconnection cost. MAX chooses the data size granularity in the systolic array in conjunction with weight duplication to achieve very high area utilization without requiring additional peripheral circuits. We provide a detailed energy and area breakdown of each component at the PE level, tile level, and system level. The system-level evaluation in 32-nm node on several VGG-network benchmarks shows that the MAX can improve computation efficiency (TOPs/s/mm) by 2.5\times and energy efficiency (TOPs/s/W) by 5.2\times compared with a state-of-the-art ReRAM-based accelerator.
|Original language||English (US)|
|Number of pages||13|
|Journal||IEEE Journal on Emerging and Selected Topics in Circuits and Systems|
|State||Published - Jun 2019|
- data reuse
ASJC Scopus subject areas
- Electrical and Electronic Engineering