Inference engine benchmarking across technological platforms from CMOS to RRAM

Xiaochen Peng, Minkyu Kim, Xiaoyu Sun, Shihui Yin, Titash Rakshit, Ryan M. Hatcher, Jorge A. Kittl, Jae sun Seo, Shimeng Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

State-of-the-art deep convolutional neural networks (CNNs) are widely used in current AI systems, and achieve remarkable success in image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and processing-in-memory (PIM) approach with emerging technologies such as resistive random access memory (RRAM). However, a comprehensive comparison of these various approaches in a unified framework is missing, and the benefits of new designs or emerging technologies are mostly based on qualitative projections. In this paper, we evaluate the energy efficiency and frame rate for a VGG-like CNN inference accelerator on CIFAR-10 dataset across the technological platforms from CMOS to post-CMOS, with hardware resource constraint, i.e. comparable on-chip area. We also investigate the effects of off-chip memory DRAM access and interconnect during data movement, which are the bottlenecks of CMOS platforms. Our quantitative analysis shows that the peripheries (ADCs) dominate in energy consumption and area (rather than memory array) in digital RRAM-based parallel readout PIM architecture. Despite presence of ADCs, this architecture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays or near memory processing, with a comparable frame rate due to reduced DRAM access, high throughput and optimized parallel read out. Further >10× improvements can be achieved by implementing bit-count reduced XNOR network and pipelining.

Original languageEnglish (US)
Title of host publicationMEMSYS 2019 - Proceedings of the International Symposium on Memory Systems
PublisherAssociation for Computing Machinery
Pages471-479
Number of pages9
ISBN (Electronic)9781450372060
DOIs
StatePublished - Sep 30 2019
Event2019 International Symposium on Memory Systems, MEMSYS 2019 - Washington, United States
Duration: Sep 30 2019Oct 3 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2019 International Symposium on Memory Systems, MEMSYS 2019
CountryUnited States
CityWashington
Period9/30/1910/3/19

Fingerprint

Inference engines
Benchmarking
Data storage equipment
Dynamic random access storage
Processing
Energy efficiency
Neural networks
Memory architecture
Systolic arrays
Speech recognition
Particle accelerators
Energy utilization
Throughput
Hardware
Chemical analysis

Keywords

  • Deep convolutional neural network
  • Hardware accelerator
  • Near memory processing
  • Processing in memory
  • Resistive random access memory
  • Systolic architecture

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Peng, X., Kim, M., Sun, X., Yin, S., Rakshit, T., Hatcher, R. M., ... Yu, S. (2019). Inference engine benchmarking across technological platforms from CMOS to RRAM. In MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems (pp. 471-479). (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3357526.3357566

Inference engine benchmarking across technological platforms from CMOS to RRAM. / Peng, Xiaochen; Kim, Minkyu; Sun, Xiaoyu; Yin, Shihui; Rakshit, Titash; Hatcher, Ryan M.; Kittl, Jorge A.; Seo, Jae sun; Yu, Shimeng.

MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems. Association for Computing Machinery, 2019. p. 471-479 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Peng, X, Kim, M, Sun, X, Yin, S, Rakshit, T, Hatcher, RM, Kittl, JA, Seo, JS & Yu, S 2019, Inference engine benchmarking across technological platforms from CMOS to RRAM. in MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems. ACM International Conference Proceeding Series, Association for Computing Machinery, pp. 471-479, 2019 International Symposium on Memory Systems, MEMSYS 2019, Washington, United States, 9/30/19. https://doi.org/10.1145/3357526.3357566
Peng X, Kim M, Sun X, Yin S, Rakshit T, Hatcher RM et al. Inference engine benchmarking across technological platforms from CMOS to RRAM. In MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems. Association for Computing Machinery. 2019. p. 471-479. (ACM International Conference Proceeding Series). https://doi.org/10.1145/3357526.3357566
Peng, Xiaochen ; Kim, Minkyu ; Sun, Xiaoyu ; Yin, Shihui ; Rakshit, Titash ; Hatcher, Ryan M. ; Kittl, Jorge A. ; Seo, Jae sun ; Yu, Shimeng. / Inference engine benchmarking across technological platforms from CMOS to RRAM. MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems. Association for Computing Machinery, 2019. pp. 471-479 (ACM International Conference Proceeding Series).
@inproceedings{87fbabc9ef584f7c930aa5cddcc17975,
title = "Inference engine benchmarking across technological platforms from CMOS to RRAM",
abstract = "State-of-the-art deep convolutional neural networks (CNNs) are widely used in current AI systems, and achieve remarkable success in image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and processing-in-memory (PIM) approach with emerging technologies such as resistive random access memory (RRAM). However, a comprehensive comparison of these various approaches in a unified framework is missing, and the benefits of new designs or emerging technologies are mostly based on qualitative projections. In this paper, we evaluate the energy efficiency and frame rate for a VGG-like CNN inference accelerator on CIFAR-10 dataset across the technological platforms from CMOS to post-CMOS, with hardware resource constraint, i.e. comparable on-chip area. We also investigate the effects of off-chip memory DRAM access and interconnect during data movement, which are the bottlenecks of CMOS platforms. Our quantitative analysis shows that the peripheries (ADCs) dominate in energy consumption and area (rather than memory array) in digital RRAM-based parallel readout PIM architecture. Despite presence of ADCs, this architecture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays or near memory processing, with a comparable frame rate due to reduced DRAM access, high throughput and optimized parallel read out. Further >10× improvements can be achieved by implementing bit-count reduced XNOR network and pipelining.",
keywords = "Deep convolutional neural network, Hardware accelerator, Near memory processing, Processing in memory, Resistive random access memory, Systolic architecture",
author = "Xiaochen Peng and Minkyu Kim and Xiaoyu Sun and Shihui Yin and Titash Rakshit and Hatcher, {Ryan M.} and Kittl, {Jorge A.} and Seo, {Jae sun} and Shimeng Yu",
year = "2019",
month = "9",
day = "30",
doi = "10.1145/3357526.3357566",
language = "English (US)",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",
pages = "471--479",
booktitle = "MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems",

}

TY - GEN

T1 - Inference engine benchmarking across technological platforms from CMOS to RRAM

AU - Peng, Xiaochen

AU - Kim, Minkyu

AU - Sun, Xiaoyu

AU - Yin, Shihui

AU - Rakshit, Titash

AU - Hatcher, Ryan M.

AU - Kittl, Jorge A.

AU - Seo, Jae sun

AU - Yu, Shimeng

PY - 2019/9/30

Y1 - 2019/9/30

N2 - State-of-the-art deep convolutional neural networks (CNNs) are widely used in current AI systems, and achieve remarkable success in image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and processing-in-memory (PIM) approach with emerging technologies such as resistive random access memory (RRAM). However, a comprehensive comparison of these various approaches in a unified framework is missing, and the benefits of new designs or emerging technologies are mostly based on qualitative projections. In this paper, we evaluate the energy efficiency and frame rate for a VGG-like CNN inference accelerator on CIFAR-10 dataset across the technological platforms from CMOS to post-CMOS, with hardware resource constraint, i.e. comparable on-chip area. We also investigate the effects of off-chip memory DRAM access and interconnect during data movement, which are the bottlenecks of CMOS platforms. Our quantitative analysis shows that the peripheries (ADCs) dominate in energy consumption and area (rather than memory array) in digital RRAM-based parallel readout PIM architecture. Despite presence of ADCs, this architecture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays or near memory processing, with a comparable frame rate due to reduced DRAM access, high throughput and optimized parallel read out. Further >10× improvements can be achieved by implementing bit-count reduced XNOR network and pipelining.

AB - State-of-the-art deep convolutional neural networks (CNNs) are widely used in current AI systems, and achieve remarkable success in image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and processing-in-memory (PIM) approach with emerging technologies such as resistive random access memory (RRAM). However, a comprehensive comparison of these various approaches in a unified framework is missing, and the benefits of new designs or emerging technologies are mostly based on qualitative projections. In this paper, we evaluate the energy efficiency and frame rate for a VGG-like CNN inference accelerator on CIFAR-10 dataset across the technological platforms from CMOS to post-CMOS, with hardware resource constraint, i.e. comparable on-chip area. We also investigate the effects of off-chip memory DRAM access and interconnect during data movement, which are the bottlenecks of CMOS platforms. Our quantitative analysis shows that the peripheries (ADCs) dominate in energy consumption and area (rather than memory array) in digital RRAM-based parallel readout PIM architecture. Despite presence of ADCs, this architecture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays or near memory processing, with a comparable frame rate due to reduced DRAM access, high throughput and optimized parallel read out. Further >10× improvements can be achieved by implementing bit-count reduced XNOR network and pipelining.

KW - Deep convolutional neural network

KW - Hardware accelerator

KW - Near memory processing

KW - Processing in memory

KW - Resistive random access memory

KW - Systolic architecture

UR - http://www.scopus.com/inward/record.url?scp=85075852947&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85075852947&partnerID=8YFLogxK

U2 - 10.1145/3357526.3357566

DO - 10.1145/3357526.3357566

M3 - Conference contribution

AN - SCOPUS:85075852947

T3 - ACM International Conference Proceeding Series

SP - 471

EP - 479

BT - MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems

PB - Association for Computing Machinery

ER -