Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features

Yishan Jiao; Ming Tu; Visar Berisha; Julie Liss

doi:10.21437/Interspeech.2016-1148

Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features

Yishan Jiao, Ming Tu, Visar Berisha, Julie Liss

Research output: Contribution to journal › Conference article › peer-review

44 Scopus citations

Abstract

Automatic identification of foreign accents is valuable for many speech systems, such as speech recognition, speaker identification, voice conversion, etc. The INTERSPEECH 2016 Native Language Sub-Challenge is to identify the native languages of non-native English speakers from eleven countries. Since differences in accent are due to both prosodic and articulation characteristics, a combination of long-term and short-term training is proposed in this paper. Each speech sample is processed into multiple speech segments with equal length. For each segment, deep neural networks (DNNs) are used to train on long-term statistical features, while recurrent neural networks (RNNs) are used to train on short-term acoustic features. The result for each speech sample is calculated by linearly fusing the results from the two sets of networks on all segments. The performance of the proposed system greatly surpasses the provided baseline system. Moreover, by fusing the results with the baseline system, the performance can be further improved.

Original language	English (US)
Pages (from-to)	2388-2392
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	08-12-September-2016
DOIs	https://doi.org/10.21437/Interspeech.2016-1148
State	Published - 2016
Event	17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States Duration: Sep 8 2016 → Sep 16 2016

Keywords

Accent identification
Articulation
Deep neural networks
Prosody

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Access to Document

10.21437/Interspeech.2016-1148

Cite this

Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features. / Jiao, Yishan; Tu, Ming; Berisha, Visar et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 08-12-September-2016, 2016, p. 2388-2392.

Research output: Contribution to journal › Conference article › peer-review

@article{9492d3638aae4215a6f3f62a45acdd04,

title = "Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features",

abstract = "Automatic identification of foreign accents is valuable for many speech systems, such as speech recognition, speaker identification, voice conversion, etc. The INTERSPEECH 2016 Native Language Sub-Challenge is to identify the native languages of non-native English speakers from eleven countries. Since differences in accent are due to both prosodic and articulation characteristics, a combination of long-term and short-term training is proposed in this paper. Each speech sample is processed into multiple speech segments with equal length. For each segment, deep neural networks (DNNs) are used to train on long-term statistical features, while recurrent neural networks (RNNs) are used to train on short-term acoustic features. The result for each speech sample is calculated by linearly fusing the results from the two sets of networks on all segments. The performance of the proposed system greatly surpasses the provided baseline system. Moreover, by fusing the results with the baseline system, the performance can be further improved.",

keywords = "Accent identification, Articulation, Deep neural networks, Prosody",

author = "Yishan Jiao and Ming Tu and Visar Berisha and Julie Liss",

note = "Funding Information: This work was partially supported by an NIH 1R21DC013812 grant. The authors graciously acknowledge a hardware donation from NVIDIA Publisher Copyright: Copyright {\textcopyright} 2016 ISCA.; 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 ; Conference date: 08-09-2016 Through 16-09-2016",

year = "2016",

doi = "10.21437/Interspeech.2016-1148",

language = "English (US)",

volume = "08-12-September-2016",

pages = "2388--2392",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features

AU - Jiao, Yishan

AU - Tu, Ming

AU - Berisha, Visar

AU - Liss, Julie

PY - 2016

Y1 - 2016

N2 - Automatic identification of foreign accents is valuable for many speech systems, such as speech recognition, speaker identification, voice conversion, etc. The INTERSPEECH 2016 Native Language Sub-Challenge is to identify the native languages of non-native English speakers from eleven countries. Since differences in accent are due to both prosodic and articulation characteristics, a combination of long-term and short-term training is proposed in this paper. Each speech sample is processed into multiple speech segments with equal length. For each segment, deep neural networks (DNNs) are used to train on long-term statistical features, while recurrent neural networks (RNNs) are used to train on short-term acoustic features. The result for each speech sample is calculated by linearly fusing the results from the two sets of networks on all segments. The performance of the proposed system greatly surpasses the provided baseline system. Moreover, by fusing the results with the baseline system, the performance can be further improved.

AB - Automatic identification of foreign accents is valuable for many speech systems, such as speech recognition, speaker identification, voice conversion, etc. The INTERSPEECH 2016 Native Language Sub-Challenge is to identify the native languages of non-native English speakers from eleven countries. Since differences in accent are due to both prosodic and articulation characteristics, a combination of long-term and short-term training is proposed in this paper. Each speech sample is processed into multiple speech segments with equal length. For each segment, deep neural networks (DNNs) are used to train on long-term statistical features, while recurrent neural networks (RNNs) are used to train on short-term acoustic features. The result for each speech sample is calculated by linearly fusing the results from the two sets of networks on all segments. The performance of the proposed system greatly surpasses the provided baseline system. Moreover, by fusing the results with the baseline system, the performance can be further improved.

KW - Accent identification

KW - Articulation

KW - Deep neural networks

KW - Prosody

UR - http://www.scopus.com/inward/record.url?scp=84994246161&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994246161&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-1148

DO - 10.21437/Interspeech.2016-1148

M3 - Conference article

AN - SCOPUS:84994246161

SN - 2308-457X

VL - 08-12-September-2016

SP - 2388

EP - 2392

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016

Y2 - 8 September 2016 through 16 September 2016

ER -

Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this