A 34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpretation Accelerator for Mobile Edge Computing

Yixing Li, Zichuan Liu, Wenye Liu, Yu Jiang, Yongliang Wang, Wang Ling Goh, Hao Yu, Fengbo Ren

Research output: Contribution to journalArticle

Abstract

The scene text interpretation is a critical part of natural scene interpretation. Currently, most of the existing work is based on high-end GPU implementation, which is commonly used on the server side. However, in IoT application scenarios, the communication overhead from edge device to the server is quite large, which sometimes even dominates the total processing time. Hence, the edge-computing oriented design is needed to solve this problem. In this paper, we present an architectural design and implementation of a natural scene text interpretation (NSTI) accelerator, which can classify and localize the text region on pixel-level efficiently in real-time on mobile devices. To target the real-time and low-latency processing, the Binary Convolutional Encoder-decoder Network (B-CEDNet) is adopted as the core architecture to enable massive parallelism due to its binary feature. Massively parallelized computations and a highly pipelined data flow control enhance its latency and throughput performance. The NSTI accelerator is implemented in a 40nm CMOS technology, which can process scene text images (size of 128x32) at 34 fps and latency of 40 ms for pixelwise interpretation with the pixelwise classification accuracy over 90% on ICDAR-03 and -13 dataset. The real energy-efficiency is 698 GOP/s/W and the peak energy-efficiency can get up to 7825 GOP/s/W. The proposed accelerator is 7× more energy efficient than its optimized GPU-based implementation counterpart, while maintaining a real-time throughput with latency of 40 ms.

Original languageEnglish (US)
JournalIEEE Transactions on Industrial Electronics
DOIs
StateAccepted/In press - Jan 1 2018

Fingerprint

Particle accelerators
Energy efficiency
Servers
Throughput
Architectural design
Processing
Flow control
Mobile devices
Pixels
Communication
Deep neural networks
Graphics processing unit
Internet of things

Keywords

  • Application specific integrated circuits
  • Mobile applications
  • Neural network hardware
  • Real-time systems

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Electrical and Electronic Engineering

Cite this

A 34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpretation Accelerator for Mobile Edge Computing. / Li, Yixing; Liu, Zichuan; Liu, Wenye; Jiang, Yu; Wang, Yongliang; Goh, Wang Ling; Yu, Hao; Ren, Fengbo.

In: IEEE Transactions on Industrial Electronics, 01.01.2018.

Research output: Contribution to journalArticle

@article{8f0b92193c5a4e488a9320fe52e7e2e9,
title = "A 34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpretation Accelerator for Mobile Edge Computing",
abstract = "The scene text interpretation is a critical part of natural scene interpretation. Currently, most of the existing work is based on high-end GPU implementation, which is commonly used on the server side. However, in IoT application scenarios, the communication overhead from edge device to the server is quite large, which sometimes even dominates the total processing time. Hence, the edge-computing oriented design is needed to solve this problem. In this paper, we present an architectural design and implementation of a natural scene text interpretation (NSTI) accelerator, which can classify and localize the text region on pixel-level efficiently in real-time on mobile devices. To target the real-time and low-latency processing, the Binary Convolutional Encoder-decoder Network (B-CEDNet) is adopted as the core architecture to enable massive parallelism due to its binary feature. Massively parallelized computations and a highly pipelined data flow control enhance its latency and throughput performance. The NSTI accelerator is implemented in a 40nm CMOS technology, which can process scene text images (size of 128x32) at 34 fps and latency of 40 ms for pixelwise interpretation with the pixelwise classification accuracy over 90{\%} on ICDAR-03 and -13 dataset. The real energy-efficiency is 698 GOP/s/W and the peak energy-efficiency can get up to 7825 GOP/s/W. The proposed accelerator is 7× more energy efficient than its optimized GPU-based implementation counterpart, while maintaining a real-time throughput with latency of 40 ms.",
keywords = "Application specific integrated circuits, Mobile applications, Neural network hardware, Real-time systems",
author = "Yixing Li and Zichuan Liu and Wenye Liu and Yu Jiang and Yongliang Wang and Goh, {Wang Ling} and Hao Yu and Fengbo Ren",
year = "2018",
month = "1",
day = "1",
doi = "10.1109/TIE.2018.2875643",
language = "English (US)",
journal = "IEEE Transactions on Industrial Electronics",
issn = "0278-0046",
publisher = "IEEE Industrial Electronics Society",

}

TY - JOUR

T1 - A 34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpretation Accelerator for Mobile Edge Computing

AU - Li, Yixing

AU - Liu, Zichuan

AU - Liu, Wenye

AU - Jiang, Yu

AU - Wang, Yongliang

AU - Goh, Wang Ling

AU - Yu, Hao

AU - Ren, Fengbo

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The scene text interpretation is a critical part of natural scene interpretation. Currently, most of the existing work is based on high-end GPU implementation, which is commonly used on the server side. However, in IoT application scenarios, the communication overhead from edge device to the server is quite large, which sometimes even dominates the total processing time. Hence, the edge-computing oriented design is needed to solve this problem. In this paper, we present an architectural design and implementation of a natural scene text interpretation (NSTI) accelerator, which can classify and localize the text region on pixel-level efficiently in real-time on mobile devices. To target the real-time and low-latency processing, the Binary Convolutional Encoder-decoder Network (B-CEDNet) is adopted as the core architecture to enable massive parallelism due to its binary feature. Massively parallelized computations and a highly pipelined data flow control enhance its latency and throughput performance. The NSTI accelerator is implemented in a 40nm CMOS technology, which can process scene text images (size of 128x32) at 34 fps and latency of 40 ms for pixelwise interpretation with the pixelwise classification accuracy over 90% on ICDAR-03 and -13 dataset. The real energy-efficiency is 698 GOP/s/W and the peak energy-efficiency can get up to 7825 GOP/s/W. The proposed accelerator is 7× more energy efficient than its optimized GPU-based implementation counterpart, while maintaining a real-time throughput with latency of 40 ms.

AB - The scene text interpretation is a critical part of natural scene interpretation. Currently, most of the existing work is based on high-end GPU implementation, which is commonly used on the server side. However, in IoT application scenarios, the communication overhead from edge device to the server is quite large, which sometimes even dominates the total processing time. Hence, the edge-computing oriented design is needed to solve this problem. In this paper, we present an architectural design and implementation of a natural scene text interpretation (NSTI) accelerator, which can classify and localize the text region on pixel-level efficiently in real-time on mobile devices. To target the real-time and low-latency processing, the Binary Convolutional Encoder-decoder Network (B-CEDNet) is adopted as the core architecture to enable massive parallelism due to its binary feature. Massively parallelized computations and a highly pipelined data flow control enhance its latency and throughput performance. The NSTI accelerator is implemented in a 40nm CMOS technology, which can process scene text images (size of 128x32) at 34 fps and latency of 40 ms for pixelwise interpretation with the pixelwise classification accuracy over 90% on ICDAR-03 and -13 dataset. The real energy-efficiency is 698 GOP/s/W and the peak energy-efficiency can get up to 7825 GOP/s/W. The proposed accelerator is 7× more energy efficient than its optimized GPU-based implementation counterpart, while maintaining a real-time throughput with latency of 40 ms.

KW - Application specific integrated circuits

KW - Mobile applications

KW - Neural network hardware

KW - Real-time systems

UR - http://www.scopus.com/inward/record.url?scp=85055678252&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055678252&partnerID=8YFLogxK

U2 - 10.1109/TIE.2018.2875643

DO - 10.1109/TIE.2018.2875643

M3 - Article

JO - IEEE Transactions on Industrial Electronics

JF - IEEE Transactions on Industrial Electronics

SN - 0278-0046

ER -