Abstract

Speaking rate estimation directly from the speech waveform is a long-standing problem in speech signal processing. In this paper, we pose the speaking rate estimation problem as that of estimating a temporal density function whose integral over a given interval yields the speaking rate within that interval. In contrast to many existing methods, we avoid the more difficult task of detecting individual phonemes within the speech signal and we avoid heuristics such as thresholding the temporal envelope to estimate the number of vowels. Rather, the proposed method aims to learn an optimal weighting function that can be directly applied to time-frequency features in a speech signal to yield a temporal density function. We propose two convex cost functions for learning the weighting functions and an adaptation strategy to customize the approach to a particular speaker using minimal training. The algorithms are evaluated on the TIMIT corpus, on a dysarthric speech corpus, and on the ICSI Switchboard spontaneous speech corpus. Results show that the proposed methods outperform three competing methods on both healthy and dysarthric speech. In addition, for spontaneous speech rate estimation, the result show a high correlation between the estimated speaking rate and ground truth values.

Original languageEnglish (US)
Article number7109110
Pages (from-to)1421-1430
Number of pages10
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume23
Issue number9
DOIs
StatePublished - Sep 1 2015

Fingerprint

weighting functions
Probability density function
intervals
entire functions
phonemes
ground truth
vowels
Cost functions
learning
signal processing
Signal processing
waveforms
education
estimating
envelopes
costs
estimates

Keywords

  • convex optimization
  • dysarthria
  • speaker adaptation
  • Speaking rate estimation
  • vowel density function

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Acoustics and Ultrasonics

Cite this

Convex Weighting Criteria for Speaking Rate Estimation. / Jiao, Yishan; Berisha, Visar; Tu, Ming; Liss, Julie.

In: IEEE Transactions on Audio, Speech and Language Processing, Vol. 23, No. 9, 7109110, 01.09.2015, p. 1421-1430.

Research output: Contribution to journalArticle

@article{cf5f92b609274c9eb9cf558fd41b45a0,
title = "Convex Weighting Criteria for Speaking Rate Estimation",
abstract = "Speaking rate estimation directly from the speech waveform is a long-standing problem in speech signal processing. In this paper, we pose the speaking rate estimation problem as that of estimating a temporal density function whose integral over a given interval yields the speaking rate within that interval. In contrast to many existing methods, we avoid the more difficult task of detecting individual phonemes within the speech signal and we avoid heuristics such as thresholding the temporal envelope to estimate the number of vowels. Rather, the proposed method aims to learn an optimal weighting function that can be directly applied to time-frequency features in a speech signal to yield a temporal density function. We propose two convex cost functions for learning the weighting functions and an adaptation strategy to customize the approach to a particular speaker using minimal training. The algorithms are evaluated on the TIMIT corpus, on a dysarthric speech corpus, and on the ICSI Switchboard spontaneous speech corpus. Results show that the proposed methods outperform three competing methods on both healthy and dysarthric speech. In addition, for spontaneous speech rate estimation, the result show a high correlation between the estimated speaking rate and ground truth values.",
keywords = "convex optimization, dysarthria, speaker adaptation, Speaking rate estimation, vowel density function",
author = "Yishan Jiao and Visar Berisha and Ming Tu and Julie Liss",
year = "2015",
month = "9",
day = "1",
doi = "10.1109/TASLP.2015.2434213",
language = "English (US)",
volume = "23",
pages = "1421--1430",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "9",

}

TY - JOUR

T1 - Convex Weighting Criteria for Speaking Rate Estimation

AU - Jiao, Yishan

AU - Berisha, Visar

AU - Tu, Ming

AU - Liss, Julie

PY - 2015/9/1

Y1 - 2015/9/1

N2 - Speaking rate estimation directly from the speech waveform is a long-standing problem in speech signal processing. In this paper, we pose the speaking rate estimation problem as that of estimating a temporal density function whose integral over a given interval yields the speaking rate within that interval. In contrast to many existing methods, we avoid the more difficult task of detecting individual phonemes within the speech signal and we avoid heuristics such as thresholding the temporal envelope to estimate the number of vowels. Rather, the proposed method aims to learn an optimal weighting function that can be directly applied to time-frequency features in a speech signal to yield a temporal density function. We propose two convex cost functions for learning the weighting functions and an adaptation strategy to customize the approach to a particular speaker using minimal training. The algorithms are evaluated on the TIMIT corpus, on a dysarthric speech corpus, and on the ICSI Switchboard spontaneous speech corpus. Results show that the proposed methods outperform three competing methods on both healthy and dysarthric speech. In addition, for spontaneous speech rate estimation, the result show a high correlation between the estimated speaking rate and ground truth values.

AB - Speaking rate estimation directly from the speech waveform is a long-standing problem in speech signal processing. In this paper, we pose the speaking rate estimation problem as that of estimating a temporal density function whose integral over a given interval yields the speaking rate within that interval. In contrast to many existing methods, we avoid the more difficult task of detecting individual phonemes within the speech signal and we avoid heuristics such as thresholding the temporal envelope to estimate the number of vowels. Rather, the proposed method aims to learn an optimal weighting function that can be directly applied to time-frequency features in a speech signal to yield a temporal density function. We propose two convex cost functions for learning the weighting functions and an adaptation strategy to customize the approach to a particular speaker using minimal training. The algorithms are evaluated on the TIMIT corpus, on a dysarthric speech corpus, and on the ICSI Switchboard spontaneous speech corpus. Results show that the proposed methods outperform three competing methods on both healthy and dysarthric speech. In addition, for spontaneous speech rate estimation, the result show a high correlation between the estimated speaking rate and ground truth values.

KW - convex optimization

KW - dysarthria

KW - speaker adaptation

KW - Speaking rate estimation

KW - vowel density function

UR - http://www.scopus.com/inward/record.url?scp=84933505427&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84933505427&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2015.2434213

DO - 10.1109/TASLP.2015.2434213

M3 - Article

VL - 23

SP - 1421

EP - 1430

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 9

M1 - 7109110

ER -