TY - GEN
T1 - A 28nm 8-bit Floating-Point Tensor Core based CNN Training Processor with Dynamic Activation/Weight Sparsification
AU - Venkataramanaiah, Shreyas Kolala
AU - Meng, Jian
AU - Suh, Han Sok
AU - Yeo, Injune
AU - Saikia, Jyotishman
AU - Cherupally, Sai Kiran
AU - Zhang, Yichi
AU - Zhang, Zhiru
AU - Seo, Jae Sun
N1 - Funding Information:
VI. ACKNOWLEDGEMENTS This work is supported in part by NSF and JUMP C-BRIC, a SRC program sponsored by DARPA.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores (fused multiply-add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3 ×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.
AB - We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores (fused multiply-add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3 ×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.
KW - Convolutional neural networks
KW - deep neural network training
KW - hardware accelerator
KW - structured sparsity
UR - http://www.scopus.com/inward/record.url?scp=85141481129&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141481129&partnerID=8YFLogxK
U2 - 10.1109/ESSCIRC55480.2022.9911359
DO - 10.1109/ESSCIRC55480.2022.9911359
M3 - Conference contribution
AN - SCOPUS:85141481129
T3 - ESSCIRC 2022 - IEEE 48th European Solid State Circuits Conference, Proceedings
SP - 89
EP - 92
BT - ESSCIRC 2022 - IEEE 48th European Solid State Circuits Conference, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th IEEE European Solid State Circuits Conference, ESSCIRC 2022
Y2 - 19 September 2022 through 22 September 2022
ER -