Articulation constrained learning with application to speech emotion recognition

Research output: Contribution to journalArticle

Abstract

Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

Original languageEnglish (US)
Article number14
JournalEurasip Journal on Audio, Speech, and Music Processing
Volume2019
Issue number1
DOIs
StatePublished - Dec 1 2019

Fingerprint

emotions
Speech recognition
learning
Acoustics
Cost functions
Logistics
acoustics
vowels
logistics
inference
Experiments
regression analysis
education
costs
valence

Keywords

  • Articulation
  • Constrained optimization
  • Cross-corpus
  • Emotion recognition

ASJC Scopus subject areas

  • Acoustics and Ultrasonics
  • Electrical and Electronic Engineering

Cite this

@article{36e823b3967541708d32861218a12d5e,
title = "Articulation constrained learning with application to speech emotion recognition",
abstract = "Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.",
keywords = "Articulation, Constrained optimization, Cross-corpus, Emotion recognition",
author = "Mohit Shah and Ming Tu and Visar Berisha and Chaitali Chakrabarti and Andreas Spanias",
year = "2019",
month = "12",
day = "1",
doi = "10.1186/s13636-019-0157-9",
language = "English (US)",
volume = "2019",
journal = "Eurasip Journal on Audio, Speech, and Music Processing",
issn = "1687-4714",
publisher = "Springer Publishing Company",
number = "1",

}

TY - JOUR

T1 - Articulation constrained learning with application to speech emotion recognition

AU - Shah, Mohit

AU - Tu, Ming

AU - Berisha, Visar

AU - Chakrabarti, Chaitali

AU - Spanias, Andreas

PY - 2019/12/1

Y1 - 2019/12/1

N2 - Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

AB - Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

KW - Articulation

KW - Constrained optimization

KW - Cross-corpus

KW - Emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85071023744&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071023744&partnerID=8YFLogxK

U2 - 10.1186/s13636-019-0157-9

DO - 10.1186/s13636-019-0157-9

M3 - Article

VL - 2019

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

SN - 1687-4714

IS - 1

M1 - 14

ER -