TY - JOUR
T1 - Performance of forced-alignment algorithms on children’s speech
AU - Mahr, Tristan J.
AU - Berisha, Visar
AU - Kawabata, Kan
AU - Liss, Julie
AU - Hustad, Katherine C.
N1 - Funding Information:
This study was funded by Grants R01 DC015653 (awarded to Hustad) and R01 DC006859 (awarded to Liss/Berisha) from the National Institute on Deafness and Other Communication Disorders. Support was also provided by a core grant to the Waisman Center, U54 HD090256, from the National Institute of Child Health and Human Development. The authors thank the children and their families who participated in this research, and the students and staff at the University of Wisconsin–Madison and Arizona State University who assisted with data collection, data reduction, and analyses.
Publisher Copyright:
© 2021 American Speech-Language-Hearing Association.
PY - 2021/6
Y1 - 2021/6
N2 - Purpose: Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method: The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results: The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion: The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material: https://doi.org/10.23641/asha. 14167058.
AB - Purpose: Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method: The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results: The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion: The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material: https://doi.org/10.23641/asha. 14167058.
UR - http://www.scopus.com/inward/record.url?scp=85108741792&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108741792&partnerID=8YFLogxK
U2 - 10.1044/2020_JSLHR-20-00268
DO - 10.1044/2020_JSLHR-20-00268
M3 - Article
C2 - 33705675
AN - SCOPUS:85108741792
VL - 64
SP - 2213
EP - 2222
JO - Journal of Speech and Hearing Disorders
JF - Journal of Speech and Hearing Disorders
SN - 1092-4388
IS - 6s
ER -