Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer

Kaiqi Zhao; Hieu Duy Nguyen; Animesh Jain; Nathan Susanj; Athanasios Mouchtaris; Lokesh Gupta; Ming Zhao

doi:10.21437/Interspeech.2022-500

Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer

Kaiqi Zhao, Hieu Duy Nguyen, Animesh Jain, Nathan Susanj, Athanasios Mouchtaris, Lokesh Gupta, Ming Zhao

Research output: Contribution to journal › Conference article › peer-review

5 Scopus citations

Abstract

Automatic Speech Recognition (ASR) is increasingly used by edge applications such as intelligent virtual assistants. However, state-of-the-art ASR models such as Recurrent Neural Network - Transducer (RNN-T) are computationally intensive on resource-constrained edge devices. Knowledge Distillation (KD) is a promising approach to compress large models by using a large model (”teacher”) to train a small model (”student”). This paper proposes a novel KD method called Log-Curriculum based Module Replacing (LCMR) for RNN-T. LCMR compresses RNN-T and addresses its unique characteristics by replacing teacher modules including multiple LSTM/Dense layers with substitutional student modules that contain less Long Short Term Memory (LSTM)/Dense layers. LCMR employs a novel nonlinear Curriculum Learning driven replacement strategy to further improve the performance by updating replacing rates with a dynamic, smoothing mechanism. Under LCMR, the student and teacher are able to interact at gradient level, and tranfser knowledge more effectively than conventional KD. Evaluation shows that LCMR reduces word-error-rate (WER) by 14.47%-33.24% relative compared to conventional KD.

Original language	English (US)
Pages (from-to)	4436-4440
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-500
State	Published - 2022
Event	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: Sep 18 2022 → Sep 22 2022

Keywords

Knowledge Distillation
Model Compression
Module Replacing
RNN-T

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Access to Document

10.21437/Interspeech.2022-500

Cite this

Zhao, K., Nguyen, H. D., Jain, A., Susanj, N., Mouchtaris, A., Gupta, L., & Zhao, M. (2022). Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 4436-4440. https://doi.org/10.21437/Interspeech.2022-500

Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer. / Zhao, Kaiqi; Nguyen, Hieu Duy; Jain, Animesh et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2022-September, 2022, p. 4436-4440.

Research output: Contribution to journal › Conference article › peer-review

Zhao, K, Nguyen, HD, Jain, A, Susanj, N, Mouchtaris, A, Gupta, L & Zhao, M 2022, 'Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2022-September, pp. 4436-4440. https://doi.org/10.21437/Interspeech.2022-500

@article{277b8438b8e94d05bb7fe967a2360d83,

title = "Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer",

abstract = "Automatic Speech Recognition (ASR) is increasingly used by edge applications such as intelligent virtual assistants. However, state-of-the-art ASR models such as Recurrent Neural Network - Transducer (RNN-T) are computationally intensive on resource-constrained edge devices. Knowledge Distillation (KD) is a promising approach to compress large models by using a large model (”teacher”) to train a small model (”student”). This paper proposes a novel KD method called Log-Curriculum based Module Replacing (LCMR) for RNN-T. LCMR compresses RNN-T and addresses its unique characteristics by replacing teacher modules including multiple LSTM/Dense layers with substitutional student modules that contain less Long Short Term Memory (LSTM)/Dense layers. LCMR employs a novel nonlinear Curriculum Learning driven replacement strategy to further improve the performance by updating replacing rates with a dynamic, smoothing mechanism. Under LCMR, the student and teacher are able to interact at gradient level, and tranfser knowledge more effectively than conventional KD. Evaluation shows that LCMR reduces word-error-rate (WER) by 14.47%-33.24% relative compared to conventional KD.",

keywords = "Knowledge Distillation, Model Compression, Module Replacing, RNN-T",

author = "Kaiqi Zhao and Nguyen, {Hieu Duy} and Animesh Jain and Nathan Susanj and Athanasios Mouchtaris and Lokesh Gupta and Ming Zhao",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-500",

language = "English (US)",

volume = "2022-September",

pages = "4436--4440",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer

AU - Zhao, Kaiqi

AU - Nguyen, Hieu Duy

AU - Jain, Animesh

AU - Susanj, Nathan

AU - Mouchtaris, Athanasios

AU - Gupta, Lokesh

AU - Zhao, Ming

PY - 2022

Y1 - 2022

N2 - Automatic Speech Recognition (ASR) is increasingly used by edge applications such as intelligent virtual assistants. However, state-of-the-art ASR models such as Recurrent Neural Network - Transducer (RNN-T) are computationally intensive on resource-constrained edge devices. Knowledge Distillation (KD) is a promising approach to compress large models by using a large model (”teacher”) to train a small model (”student”). This paper proposes a novel KD method called Log-Curriculum based Module Replacing (LCMR) for RNN-T. LCMR compresses RNN-T and addresses its unique characteristics by replacing teacher modules including multiple LSTM/Dense layers with substitutional student modules that contain less Long Short Term Memory (LSTM)/Dense layers. LCMR employs a novel nonlinear Curriculum Learning driven replacement strategy to further improve the performance by updating replacing rates with a dynamic, smoothing mechanism. Under LCMR, the student and teacher are able to interact at gradient level, and tranfser knowledge more effectively than conventional KD. Evaluation shows that LCMR reduces word-error-rate (WER) by 14.47%-33.24% relative compared to conventional KD.

AB - Automatic Speech Recognition (ASR) is increasingly used by edge applications such as intelligent virtual assistants. However, state-of-the-art ASR models such as Recurrent Neural Network - Transducer (RNN-T) are computationally intensive on resource-constrained edge devices. Knowledge Distillation (KD) is a promising approach to compress large models by using a large model (”teacher”) to train a small model (”student”). This paper proposes a novel KD method called Log-Curriculum based Module Replacing (LCMR) for RNN-T. LCMR compresses RNN-T and addresses its unique characteristics by replacing teacher modules including multiple LSTM/Dense layers with substitutional student modules that contain less Long Short Term Memory (LSTM)/Dense layers. LCMR employs a novel nonlinear Curriculum Learning driven replacement strategy to further improve the performance by updating replacing rates with a dynamic, smoothing mechanism. Under LCMR, the student and teacher are able to interact at gradient level, and tranfser knowledge more effectively than conventional KD. Evaluation shows that LCMR reduces word-error-rate (WER) by 14.47%-33.24% relative compared to conventional KD.

KW - Knowledge Distillation

KW - Model Compression

KW - Module Replacing

KW - RNN-T

UR - http://www.scopus.com/inward/record.url?scp=85140092095&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85140092095&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-500

DO - 10.21437/Interspeech.2022-500

M3 - Conference article

AN - SCOPUS:85140092095

SN - 2308-457X

VL - 2022-September

SP - 4436

EP - 4440

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

Y2 - 18 September 2022 through 22 September 2022

ER -

Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this