Abstract

Overlapped speech poses a significant problem in a variety of applications in speech processing including speaker identification, speaker diarization, and speech recognition among others. To address it, existing systems combine source separation with algorithms for processing non-overlapped speech (e.g. source separation + follow-on speech recognition). In this paper we propose a modified network architecture to simultaneously recognize keywords from overlapped speech without explicitly having to perform source separation. We build our network by adding capsule layers to a ResNet architecture that has shown state-of-the-art performance on a traditional keyword recognition task. We evaluate the model on a series of 10-word overlapped keyword recognition experiments, using speaker dependent and speaker independent training. Results indicate that Residual + Capsule (ResCap) network shows marked improvement in recognizing overlapped speech, especially in experiments where there is a mismatch in the number of overlapped speakers between the training set and the test set.

Original languageEnglish (US)
Pages (from-to)3337-3341
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2019-September
DOIs
StatePublished - Jan 1 2019
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: Sep 15 2019Sep 19 2019

Fingerprint

Source separation
Source Separation
Speech Recognition
Speech recognition
Speaker Identification
Speech Processing
Speech processing
Speaker Recognition
Test Set
Network Architecture
Network architecture
Experiment
Experiments
Series
Speech
Key Words
Dependent
Evaluate
Processing
Training

Keywords

  • Capsule networks
  • Keyword spotting
  • Overlapped speech
  • Recognition
  • Residual networks
  • ResNet
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

@article{fc77799c65a44a308adb83005d8dda69,
title = "Residual + capsule networks (RESCAP) for simultaneous single-channel overlapped keyword recognition",
abstract = "Overlapped speech poses a significant problem in a variety of applications in speech processing including speaker identification, speaker diarization, and speech recognition among others. To address it, existing systems combine source separation with algorithms for processing non-overlapped speech (e.g. source separation + follow-on speech recognition). In this paper we propose a modified network architecture to simultaneously recognize keywords from overlapped speech without explicitly having to perform source separation. We build our network by adding capsule layers to a ResNet architecture that has shown state-of-the-art performance on a traditional keyword recognition task. We evaluate the model on a series of 10-word overlapped keyword recognition experiments, using speaker dependent and speaker independent training. Results indicate that Residual + Capsule (ResCap) network shows marked improvement in recognizing overlapped speech, especially in experiments where there is a mismatch in the number of overlapped speakers between the training set and the test set.",
keywords = "Capsule networks, Keyword spotting, Overlapped speech, Recognition, Residual networks, ResNet, Speech recognition",
author = "Yan Xiong and Visar Berisha and Chaitali Chakrabarti",
year = "2019",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2019-2913",
language = "English (US)",
volume = "2019-September",
pages = "3337--3341",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Residual + capsule networks (RESCAP) for simultaneous single-channel overlapped keyword recognition

AU - Xiong, Yan

AU - Berisha, Visar

AU - Chakrabarti, Chaitali

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Overlapped speech poses a significant problem in a variety of applications in speech processing including speaker identification, speaker diarization, and speech recognition among others. To address it, existing systems combine source separation with algorithms for processing non-overlapped speech (e.g. source separation + follow-on speech recognition). In this paper we propose a modified network architecture to simultaneously recognize keywords from overlapped speech without explicitly having to perform source separation. We build our network by adding capsule layers to a ResNet architecture that has shown state-of-the-art performance on a traditional keyword recognition task. We evaluate the model on a series of 10-word overlapped keyword recognition experiments, using speaker dependent and speaker independent training. Results indicate that Residual + Capsule (ResCap) network shows marked improvement in recognizing overlapped speech, especially in experiments where there is a mismatch in the number of overlapped speakers between the training set and the test set.

AB - Overlapped speech poses a significant problem in a variety of applications in speech processing including speaker identification, speaker diarization, and speech recognition among others. To address it, existing systems combine source separation with algorithms for processing non-overlapped speech (e.g. source separation + follow-on speech recognition). In this paper we propose a modified network architecture to simultaneously recognize keywords from overlapped speech without explicitly having to perform source separation. We build our network by adding capsule layers to a ResNet architecture that has shown state-of-the-art performance on a traditional keyword recognition task. We evaluate the model on a series of 10-word overlapped keyword recognition experiments, using speaker dependent and speaker independent training. Results indicate that Residual + Capsule (ResCap) network shows marked improvement in recognizing overlapped speech, especially in experiments where there is a mismatch in the number of overlapped speakers between the training set and the test set.

KW - Capsule networks

KW - Keyword spotting

KW - Overlapped speech

KW - Recognition

KW - Residual networks

KW - ResNet

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85074733289&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074733289&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-2913

DO - 10.21437/Interspeech.2019-2913

M3 - Conference article

AN - SCOPUS:85074733289

VL - 2019-September

SP - 3337

EP - 3341

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -