Overlapped speech poses a significant problem in a variety of applications in speech processing including speaker identification, speaker diarization, and speech recognition among others. To address it, existing systems combine source separation with algorithms for processing non-overlapped speech (e.g. source separation + follow-on speech recognition). In this paper we propose a modified network architecture to simultaneously recognize keywords from overlapped speech without explicitly having to perform source separation. We build our network by adding capsule layers to a ResNet architecture that has shown state-of-the-art performance on a traditional keyword recognition task. We evaluate the model on a series of 10-word overlapped keyword recognition experiments, using speaker dependent and speaker independent training. Results indicate that Residual + Capsule (ResCap) network shows marked improvement in recognizing overlapped speech, especially in experiments where there is a mismatch in the number of overlapped speakers between the training set and the test set.

Original languageEnglish (US)
Pages (from-to)3337-3341
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - Jan 1 2019
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: Sep 15 2019Sep 19 2019


  • Capsule networks
  • Keyword spotting
  • Overlapped speech
  • Recognition
  • Residual networks
  • ResNet
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint Dive into the research topics of 'Residual + capsule networks (RESCAP) for simultaneous single-channel overlapped keyword recognition'. Together they form a unique fingerprint.

Cite this