A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors

Win San Khwa, Jia Jing Chen, Jia Fang Li, Xin Si, En Yu Yang, Xiaoyu Sun, Rui Liu, Pai Yu Chen, Qiang Li, Shimeng Yu, Meng Fan Chang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

For deep-neural-network (DNN) processors [1-4], the product-sum (PS) operation predominates the computational workload for both convolution (CNVL) and fully-connect (FCNL) neural-network (NN) layers. This hinders the adoption of DNN processors to on the edge artificial-intelligence (AI) devices, which require low-power, low-cost and fast inference. Binary DNNs [5-6] are used to reduce computation and hardware costs for AI edge devices; however, a memory bottleneck still remains. In Fig. 31.5.1 conventional PE arrays exploit parallelized computation, but suffer from inefficient single-row SRAM access to weights and intermediate data. Computing-in-memory (CIM) improves efficiency by enabling parallel computing, reducing memory accesses, and suppressing intermediate data. Nonetheless, three critical challenges remain (Fig. 31.5.2), particularly for FCNL. We overcome these problems by co-optimizing the circuits and the system. Recently, researches have been focusing on XNOR based binary-DNN structures [6]. Although they achieve a slightly higher accuracy, than other binary structures, they require a significant hardware cost (i.e. 8T-12T SRAM) to implement a CIM system. To further reduce the hardware cost, by using 6T SRAM to implement a CIM system, we employ binary DNN with 0/1-neuron and ±1-weight that was proposed in [7]. We implemented a 65nm 4Kb algorithm-dependent CIM-SRAM unit-macro and in-house binary DNN structure (focusing on FCNL with a simplified PE array), for cost-aware DNN AI edge processors. This resulted in the first binary-based CIM-SRAM macro with the fastest (2.3ns) PS operation, and the highest energy-efficiency (55.8TOPS/W) among reported CIM macros [3-4].

Original languageEnglish (US)
Title of host publication2018 IEEE International Solid-State Circuits Conference, ISSCC 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages496-498
Number of pages3
Volume61
ISBN (Electronic)9781509049394
DOIs
StatePublished - Mar 8 2018
Event65th IEEE International Solid-State Circuits Conference, ISSCC 2018 - San Francisco, United States
Duration: Feb 11 2018Feb 15 2018

Other

Other65th IEEE International Solid-State Circuits Conference, ISSCC 2018
CountryUnited States
CitySan Francisco
Period2/11/182/15/18

Fingerprint

Static random access storage
Macros
Data storage equipment
Artificial intelligence
Costs
Hardware
Deep neural networks
Network layers
Parallel processing systems
Convolution
Computer hardware
Neurons
Energy efficiency
Computer systems
Neural networks
Networks (circuits)

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Electrical and Electronic Engineering

Cite this

Khwa, W. S., Chen, J. J., Li, J. F., Si, X., Yang, E. Y., Sun, X., ... Chang, M. F. (2018). A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors. In 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018 (Vol. 61, pp. 496-498). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISSCC.2018.8310401

A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors. / Khwa, Win San; Chen, Jia Jing; Li, Jia Fang; Si, Xin; Yang, En Yu; Sun, Xiaoyu; Liu, Rui; Chen, Pai Yu; Li, Qiang; Yu, Shimeng; Chang, Meng Fan.

2018 IEEE International Solid-State Circuits Conference, ISSCC 2018. Vol. 61 Institute of Electrical and Electronics Engineers Inc., 2018. p. 496-498.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Khwa, WS, Chen, JJ, Li, JF, Si, X, Yang, EY, Sun, X, Liu, R, Chen, PY, Li, Q, Yu, S & Chang, MF 2018, A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors. in 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018. vol. 61, Institute of Electrical and Electronics Engineers Inc., pp. 496-498, 65th IEEE International Solid-State Circuits Conference, ISSCC 2018, San Francisco, United States, 2/11/18. https://doi.org/10.1109/ISSCC.2018.8310401
Khwa WS, Chen JJ, Li JF, Si X, Yang EY, Sun X et al. A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors. In 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018. Vol. 61. Institute of Electrical and Electronics Engineers Inc. 2018. p. 496-498 https://doi.org/10.1109/ISSCC.2018.8310401
Khwa, Win San ; Chen, Jia Jing ; Li, Jia Fang ; Si, Xin ; Yang, En Yu ; Sun, Xiaoyu ; Liu, Rui ; Chen, Pai Yu ; Li, Qiang ; Yu, Shimeng ; Chang, Meng Fan. / A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors. 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018. Vol. 61 Institute of Electrical and Electronics Engineers Inc., 2018. pp. 496-498
@inproceedings{35ebbd89369e46c0a426d492ba3c9567,
title = "A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors",
abstract = "For deep-neural-network (DNN) processors [1-4], the product-sum (PS) operation predominates the computational workload for both convolution (CNVL) and fully-connect (FCNL) neural-network (NN) layers. This hinders the adoption of DNN processors to on the edge artificial-intelligence (AI) devices, which require low-power, low-cost and fast inference. Binary DNNs [5-6] are used to reduce computation and hardware costs for AI edge devices; however, a memory bottleneck still remains. In Fig. 31.5.1 conventional PE arrays exploit parallelized computation, but suffer from inefficient single-row SRAM access to weights and intermediate data. Computing-in-memory (CIM) improves efficiency by enabling parallel computing, reducing memory accesses, and suppressing intermediate data. Nonetheless, three critical challenges remain (Fig. 31.5.2), particularly for FCNL. We overcome these problems by co-optimizing the circuits and the system. Recently, researches have been focusing on XNOR based binary-DNN structures [6]. Although they achieve a slightly higher accuracy, than other binary structures, they require a significant hardware cost (i.e. 8T-12T SRAM) to implement a CIM system. To further reduce the hardware cost, by using 6T SRAM to implement a CIM system, we employ binary DNN with 0/1-neuron and ±1-weight that was proposed in [7]. We implemented a 65nm 4Kb algorithm-dependent CIM-SRAM unit-macro and in-house binary DNN structure (focusing on FCNL with a simplified PE array), for cost-aware DNN AI edge processors. This resulted in the first binary-based CIM-SRAM macro with the fastest (2.3ns) PS operation, and the highest energy-efficiency (55.8TOPS/W) among reported CIM macros [3-4].",
author = "Khwa, {Win San} and Chen, {Jia Jing} and Li, {Jia Fang} and Xin Si and Yang, {En Yu} and Xiaoyu Sun and Rui Liu and Chen, {Pai Yu} and Qiang Li and Shimeng Yu and Chang, {Meng Fan}",
year = "2018",
month = "3",
day = "8",
doi = "10.1109/ISSCC.2018.8310401",
language = "English (US)",
volume = "61",
pages = "496--498",
booktitle = "2018 IEEE International Solid-State Circuits Conference, ISSCC 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors

AU - Khwa, Win San

AU - Chen, Jia Jing

AU - Li, Jia Fang

AU - Si, Xin

AU - Yang, En Yu

AU - Sun, Xiaoyu

AU - Liu, Rui

AU - Chen, Pai Yu

AU - Li, Qiang

AU - Yu, Shimeng

AU - Chang, Meng Fan

PY - 2018/3/8

Y1 - 2018/3/8

N2 - For deep-neural-network (DNN) processors [1-4], the product-sum (PS) operation predominates the computational workload for both convolution (CNVL) and fully-connect (FCNL) neural-network (NN) layers. This hinders the adoption of DNN processors to on the edge artificial-intelligence (AI) devices, which require low-power, low-cost and fast inference. Binary DNNs [5-6] are used to reduce computation and hardware costs for AI edge devices; however, a memory bottleneck still remains. In Fig. 31.5.1 conventional PE arrays exploit parallelized computation, but suffer from inefficient single-row SRAM access to weights and intermediate data. Computing-in-memory (CIM) improves efficiency by enabling parallel computing, reducing memory accesses, and suppressing intermediate data. Nonetheless, three critical challenges remain (Fig. 31.5.2), particularly for FCNL. We overcome these problems by co-optimizing the circuits and the system. Recently, researches have been focusing on XNOR based binary-DNN structures [6]. Although they achieve a slightly higher accuracy, than other binary structures, they require a significant hardware cost (i.e. 8T-12T SRAM) to implement a CIM system. To further reduce the hardware cost, by using 6T SRAM to implement a CIM system, we employ binary DNN with 0/1-neuron and ±1-weight that was proposed in [7]. We implemented a 65nm 4Kb algorithm-dependent CIM-SRAM unit-macro and in-house binary DNN structure (focusing on FCNL with a simplified PE array), for cost-aware DNN AI edge processors. This resulted in the first binary-based CIM-SRAM macro with the fastest (2.3ns) PS operation, and the highest energy-efficiency (55.8TOPS/W) among reported CIM macros [3-4].

AB - For deep-neural-network (DNN) processors [1-4], the product-sum (PS) operation predominates the computational workload for both convolution (CNVL) and fully-connect (FCNL) neural-network (NN) layers. This hinders the adoption of DNN processors to on the edge artificial-intelligence (AI) devices, which require low-power, low-cost and fast inference. Binary DNNs [5-6] are used to reduce computation and hardware costs for AI edge devices; however, a memory bottleneck still remains. In Fig. 31.5.1 conventional PE arrays exploit parallelized computation, but suffer from inefficient single-row SRAM access to weights and intermediate data. Computing-in-memory (CIM) improves efficiency by enabling parallel computing, reducing memory accesses, and suppressing intermediate data. Nonetheless, three critical challenges remain (Fig. 31.5.2), particularly for FCNL. We overcome these problems by co-optimizing the circuits and the system. Recently, researches have been focusing on XNOR based binary-DNN structures [6]. Although they achieve a slightly higher accuracy, than other binary structures, they require a significant hardware cost (i.e. 8T-12T SRAM) to implement a CIM system. To further reduce the hardware cost, by using 6T SRAM to implement a CIM system, we employ binary DNN with 0/1-neuron and ±1-weight that was proposed in [7]. We implemented a 65nm 4Kb algorithm-dependent CIM-SRAM unit-macro and in-house binary DNN structure (focusing on FCNL with a simplified PE array), for cost-aware DNN AI edge processors. This resulted in the first binary-based CIM-SRAM macro with the fastest (2.3ns) PS operation, and the highest energy-efficiency (55.8TOPS/W) among reported CIM macros [3-4].

UR - http://www.scopus.com/inward/record.url?scp=85046438965&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046438965&partnerID=8YFLogxK

U2 - 10.1109/ISSCC.2018.8310401

DO - 10.1109/ISSCC.2018.8310401

M3 - Conference contribution

AN - SCOPUS:85046438965

VL - 61

SP - 496

EP - 498

BT - 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -