TY - GEN
T1 - Computationally-efficient voice activity detection based on deep neural networks
AU - Xiong, Yan
AU - Berisha, Visar
AU - Chakrabarti, Chaitali
N1 - Funding Information:
This paper was supported in part by a grant from AFRL and DARPA under agreement number FA8650-18-2-7864.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance.
AB - Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance.
KW - Capsule network
KW - Deep neural network
KW - Low-power architecture
KW - Voice activity detection
UR - http://www.scopus.com/inward/record.url?scp=85122852120&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122852120&partnerID=8YFLogxK
U2 - 10.1109/SiPS52927.2021.00020
DO - 10.1109/SiPS52927.2021.00020
M3 - Conference contribution
AN - SCOPUS:85122852120
T3 - IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation
SP - 64
EP - 69
BT - Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021
Y2 - 19 October 2021 through 21 October 2021
ER -