Computationally-efficient voice activity detection based on deep neural networks

Yan Xiong; Visar Berisha; Chaitali Chakrabarti

doi:10.1109/SiPS52927.2021.00020

Computationally-efficient voice activity detection based on deep neural networks

Yan Xiong, Visar Berisha, Chaitali Chakrabarti

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Scopus citations

Abstract

Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance.

Original language	English (US)
Title of host publication	Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	64-69
Number of pages	6
ISBN (Electronic)	9781665401449
DOIs	https://doi.org/10.1109/SiPS52927.2021.00020
State	Published - 2021
Event	2021 IEEE Workshop on Signal Processing Systems, SiPS 2021 - Coimbra, Portugal Duration: Oct 19 2021 → Oct 21 2021

Publication series

Name	IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation
Volume	2021-October
ISSN (Print)	1520-6130

Conference

Conference	2021 IEEE Workshop on Signal Processing Systems, SiPS 2021
Country/Territory	Portugal
City	Coimbra
Period	10/19/21 → 10/21/21

Keywords

Capsule network
Deep neural network
Low-power architecture
Voice activity detection

ASJC Scopus subject areas

Electrical and Electronic Engineering
Signal Processing
Applied Mathematics
Hardware and Architecture

Access to Document

10.1109/SiPS52927.2021.00020

Cite this

Xiong, Y., Berisha, V., & Chakrabarti, C. (2021). Computationally-efficient voice activity detection based on deep neural networks. In Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021 (pp. 64-69). (IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation; Vol. 2021-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SiPS52927.2021.00020

Computationally-efficient voice activity detection based on deep neural networks. / Xiong, Yan; Berisha, Visar ; Chakrabarti, Chaitali.
Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021. Institute of Electrical and Electronics Engineers Inc., 2021. p. 64-69 (IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation; Vol. 2021-October).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Xiong, Y, Berisha, V & Chakrabarti, C 2021, Computationally-efficient voice activity detection based on deep neural networks. in Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021. IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation, vol. 2021-October, Institute of Electrical and Electronics Engineers Inc., pp. 64-69, 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021, Coimbra, Portugal, 10/19/21. https://doi.org/10.1109/SiPS52927.2021.00020

Xiong Y, Berisha V , Chakrabarti C. Computationally-efficient voice activity detection based on deep neural networks. In Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021. Institute of Electrical and Electronics Engineers Inc. 2021. p. 64-69. (IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation). doi: 10.1109/SiPS52927.2021.00020

@inproceedings{8dc2a94088e1438ba53945251e418e2b,

title = "Computationally-efficient voice activity detection based on deep neural networks",

abstract = "Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance. ",

keywords = "Capsule network, Deep neural network, Low-power architecture, Voice activity detection",

author = "Yan Xiong and Visar Berisha and Chaitali Chakrabarti",

note = "Funding Information: This paper was supported in part by a grant from AFRL and DARPA under agreement number FA8650-18-2-7864. Publisher Copyright: {\textcopyright} 2021 IEEE.; 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021 ; Conference date: 19-10-2021 Through 21-10-2021",

year = "2021",

doi = "10.1109/SiPS52927.2021.00020",

language = "English (US)",

series = "IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "64--69",

booktitle = "Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021",

}

TY - GEN

T1 - Computationally-efficient voice activity detection based on deep neural networks

AU - Xiong, Yan

AU - Berisha, Visar

AU - Chakrabarti, Chaitali

PY - 2021

Y1 - 2021

N2 - Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance.

AB - Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance.

KW - Capsule network

KW - Deep neural network

KW - Low-power architecture

KW - Voice activity detection

UR - http://www.scopus.com/inward/record.url?scp=85122852120&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85122852120&partnerID=8YFLogxK

U2 - 10.1109/SiPS52927.2021.00020

DO - 10.1109/SiPS52927.2021.00020

M3 - Conference contribution

AN - SCOPUS:85122852120

T3 - IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation

SP - 64

EP - 69

BT - Proceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021

Y2 - 19 October 2021 through 21 October 2021

ER -

Computationally-efficient voice activity detection based on deep neural networks

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this