TY - JOUR
T1 - A 34-FPS 698-GOP/s/W Binarized Deep Neural Network-Based Natural Scene Text Interpretation Accelerator for Mobile Edge Computing
AU - Li, Yixing
AU - Liu, Zichuan
AU - Liu, Wenye
AU - Jiang, Yu
AU - Wang, Yongliang
AU - Goh, Wang Ling
AU - Yu, Hao
AU - Ren, Fengbo
N1 - Funding Information:
Manuscript received April 30, 2018; revised August 25, 2018; accepted September 21, 2018. Date of publication October 29, 2018; date of current version April 30, 2019. Arizona State University’s work was supported by National Science Foundation under Grant IIS/CPS-1652038. Nanyang Technological Unversity’s work was supported by MOE AcRF Tier 2 under Grant MOE2015-T2-2-013. (Corresponding author: Yixing Li.) Y. Li and F. Ren are with the School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe AZ 85281 USA (e-mail:,yixingli@asu.edu; renfengbo@asu.edu).
Publisher Copyright:
© 1982-2012 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - The scene text interpretation is a critical part of the natural scene interpretation. Currently, most of the existing work is based on high-end graphics processing units (GPUs) implementation, which is commonly used on the server side. However, in Internet of Things (IoT) application scenarios, the communication overhead from the edge device to the server is quite large, which sometimes even dominates the total processing time. Hence, the edge-computing oriented design is needed to solve this problem. In this paper, we present an architectural design and implementation of a natural scene text interpretation (NSTI) accelerator, which can classify and localize the text region on pixel-level efficiently in real-time on mobile devices. To target the real-time and low-latency processing, the binary convolutional encoder-decoder network is adopted as the core architecture to enable massive parallelism due to its binary feature. Massively parallelized computations and a highly pipelined data flow control enhance its latency and throughput performance. In addition, all the binarized intermediate results and parameters are stored on chip to eliminate the power consumption and latency overhead of the off-chip communication. The NSTI accelerator is implemented in a 40 nm CMOS technology, which can process scene text images (size of 128 × 32) at 34 fps and latency of 40 ms for pixelwise interpretation with the pixelwise classification accuracy over 90% on ICDAR-03 and ICDAR-13 dataset. The real energy-efficiency is 698 GOP/s/W and the peak energy-efficiency can get up to 7825 GOP/s/W. The proposed accelerator is 7 times more energy efficient than its optimized GPU-based implementation counterpart, while maintaining a real-time throughput with latency of 40 ms.
AB - The scene text interpretation is a critical part of the natural scene interpretation. Currently, most of the existing work is based on high-end graphics processing units (GPUs) implementation, which is commonly used on the server side. However, in Internet of Things (IoT) application scenarios, the communication overhead from the edge device to the server is quite large, which sometimes even dominates the total processing time. Hence, the edge-computing oriented design is needed to solve this problem. In this paper, we present an architectural design and implementation of a natural scene text interpretation (NSTI) accelerator, which can classify and localize the text region on pixel-level efficiently in real-time on mobile devices. To target the real-time and low-latency processing, the binary convolutional encoder-decoder network is adopted as the core architecture to enable massive parallelism due to its binary feature. Massively parallelized computations and a highly pipelined data flow control enhance its latency and throughput performance. In addition, all the binarized intermediate results and parameters are stored on chip to eliminate the power consumption and latency overhead of the off-chip communication. The NSTI accelerator is implemented in a 40 nm CMOS technology, which can process scene text images (size of 128 × 32) at 34 fps and latency of 40 ms for pixelwise interpretation with the pixelwise classification accuracy over 90% on ICDAR-03 and ICDAR-13 dataset. The real energy-efficiency is 698 GOP/s/W and the peak energy-efficiency can get up to 7825 GOP/s/W. The proposed accelerator is 7 times more energy efficient than its optimized GPU-based implementation counterpart, while maintaining a real-time throughput with latency of 40 ms.
KW - Application specific integrated circuits
KW - mobile applications
KW - neural network hardware
KW - real-time systems
UR - http://www.scopus.com/inward/record.url?scp=85055678252&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85055678252&partnerID=8YFLogxK
U2 - 10.1109/TIE.2018.2875643
DO - 10.1109/TIE.2018.2875643
M3 - Article
AN - SCOPUS:85055678252
SN - 0278-0046
VL - 66
SP - 7407
EP - 7416
JO - IEEE Transactions on Industrial Electronics
JF - IEEE Transactions on Industrial Electronics
IS - 9
M1 - 8513982
ER -