Performance of forced-alignment algorithms on children’s speech

Tristan J. Mahr; Visar Berisha; Kan Kawabata; Julie Liss; Katherine C. Hustad

doi:10.1044/2020_JSLHR-20-00268

Performance of forced-alignment algorithms on children’s speech

Tristan J. Mahr, Visar Berisha, Kan Kawabata, Julie Liss, Katherine C. Hustad

Health Solutions, College of (CHS)

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

Purpose: Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method: The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results: The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion: The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material: https://doi.org/10.23641/asha. 14167058.

Original language	English (US)
Pages (from-to)	2213-2222
Number of pages	10
Journal	Journal of Speech, Language, and Hearing Research
Volume	64
Issue number	6s
DOIs	https://doi.org/10.1044/2020_JSLHR-20-00268
State	Published - Jun 2021

ASJC Scopus subject areas

Language and Linguistics
Linguistics and Language
Speech and Hearing

Access to Document

10.1044/2020_JSLHR-20-00268

Cite this

@article{227521e41b9949cab5ee440f0cf619d0,

title = "Performance of forced-alignment algorithms on children{\textquoteright}s speech",

abstract = "Purpose: Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method: The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results: The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion: The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material: https://doi.org/10.23641/asha. 14167058.",

author = "Mahr, {Tristan J.} and Visar Berisha and Kan Kawabata and Julie Liss and Hustad, {Katherine C.}",

note = "Publisher Copyright: {\textcopyright} 2021 American Speech-Language-Hearing Association.",

year = "2021",

month = jun,

doi = "10.1044/2020_JSLHR-20-00268",

language = "English (US)",

volume = "64",

pages = "2213--2222",

journal = "Journal of Speech, Language, and Hearing Research",

issn = "1092-4388",

publisher = "American Speech-Language-Hearing Association (ASHA)",

number = "6s",

}

TY - JOUR

T1 - Performance of forced-alignment algorithms on children’s speech

AU - Mahr, Tristan J.

AU - Berisha, Visar

AU - Kawabata, Kan

AU - Liss, Julie

AU - Hustad, Katherine C.

PY - 2021/6

Y1 - 2021/6

N2 - Purpose: Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method: The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results: The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion: The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material: https://doi.org/10.23641/asha. 14167058.

AB - Purpose: Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method: The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results: The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion: The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material: https://doi.org/10.23641/asha. 14167058.

UR - http://www.scopus.com/inward/record.url?scp=85108741792&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85108741792&partnerID=8YFLogxK

U2 - 10.1044/2020_JSLHR-20-00268

DO - 10.1044/2020_JSLHR-20-00268

M3 - Article

C2 - 33705675

AN - SCOPUS:85108741792

SN - 1092-4388

VL - 64

SP - 2213

EP - 2222

JO - Journal of Speech, Language, and Hearing Research

JF - Journal of Speech, Language, and Hearing Research

IS - 6s

ER -

Performance of forced-alignment algorithms on children’s speech

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Forced alignment of child speech (Mahr et al., 2021)

Cite this

Performance of forced-alignment algorithms on children’s speech

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Datasets

Forced alignment of child speech (Mahr et al., 2021)

Cite this