TY - GEN
T1 - A speech emotion recognition framework based on latent Dirichlet allocation
T2 - 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
AU - Shah, Mohit
AU - Miao, Lifeng
AU - Chakrabarti, Chaitali
AU - Spanias, Andreas
PY - 2013/10/18
Y1 - 2013/10/18
N2 - In this paper, we present a speech-based emotion recognition framework based on a latent Dirichlet allocation model. This method assumes that incoming speech frames are conditionally independent and exchangeable. While this leads to a loss of temporal structure, it is able to capture significant statistical information between frames. In contrast, a hidden Markov model-based approach captures the temporal structure in speech. Using the German emotional speech database EMO-DB for evaluation, we achieve an average classification accuracy of 80.7% compared to 73% for hidden Markov models. This improvement is achieved at the cost of a slight increase in computational complexity. We map the proposed algorithm onto an FPGA platform and show that emotions in a speech utterance of duration 1.5s can be identified in 1.8ms, while utilizing 70% of the resources. This further demonstrates the suitability of our approach for real-time applications on hand-held devices.
AB - In this paper, we present a speech-based emotion recognition framework based on a latent Dirichlet allocation model. This method assumes that incoming speech frames are conditionally independent and exchangeable. While this leads to a loss of temporal structure, it is able to capture significant statistical information between frames. In contrast, a hidden Markov model-based approach captures the temporal structure in speech. Using the German emotional speech database EMO-DB for evaluation, we achieve an average classification accuracy of 80.7% compared to 73% for hidden Markov models. This improvement is achieved at the cost of a slight increase in computational complexity. We map the proposed algorithm onto an FPGA platform and show that emotions in a speech utterance of duration 1.5s can be identified in 1.8ms, while utilizing 70% of the resources. This further demonstrates the suitability of our approach for real-time applications on hand-held devices.
KW - FPGA implementation
KW - affective computing
KW - emotion recognition
KW - latent Dirichlet allocation
UR - http://www.scopus.com/inward/record.url?scp=84890452441&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84890452441&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2013.6638116
DO - 10.1109/ICASSP.2013.6638116
M3 - Conference contribution
AN - SCOPUS:84890452441
SN - 9781479903566
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 2553
EP - 2557
BT - 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Proceedings
Y2 - 26 May 2013 through 31 May 2013
ER -