TY - JOUR
T1 - Automatic detection of learner's affect from conversational cues
AU - D'Mello, Sidney K.
AU - Craig, Scotty D.
AU - Witherspoon, Amy
AU - McDaniel, Bethany
AU - Graesser, Arthur
N1 - Funding Information:
Acknowledgements We thank our research colleagues in the Emotive Computing Group and the Tutoring Research Group (TRG) at the University of Memphis (http://emotion.autotutor.org). Special thanks to Barry Gholson, Jeremiah Sullins, Patrick Chipman, Max Louwerse, Kristy Tapp, Brandon King, and Stan Franklin for their valuable contributions to this study. We gratefully acknowledge our partners at the Affective Computing Research Group at MIT including Rosalind Picard, Ashish Kapoor, Barry Kort, and Robert Reilly. The authors would also like to thank the three anonymous reviewers and the guest editors Sandra Carberry and Fiorella de Rosis for their insightful reviews that significantly improved this paper. This research was supported by the National Science Foundation (REC 0106965, ITR 0325428, and REC 0633918) and the DoD Multidisciplinary University Research Initiative administered by ONR under grant N00014-00-1-0600. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of NSF, DoD, or ONR.
PY - 2008/2
Y1 - 2008/2
N2 - We explored the reliability of detecting a learner's affect from conversational features extracted from interactions with AutoTutor, an intelligent tutoring system (ITS) that helps students learn by holding a conversation in natural language. Training data were collected in a learning session with AutoTutor, after which the affective states of the learner were rated by the learner, a peer, and two trained judges. Inter-rater reliability scores indicated that the classifications of the trained judges were more reliable than the novice judges. Seven data sets that temporally integrated the affective judgments with the dialogue features of each learner were constructed. The first four datasets corresponded to the judgments of the learner, a peer, and two trained judges, while the remaining three data sets combined judgments of two or more raters. Multiple regression analyses confirmed the hypothesis that dialogue features could significantly predict the affective states of boredom, confusion, flow, and frustration. Machine learning experiments indicated that standard classifiers were moderately successful in discriminating the affective states of boredom, confusion, flow, frustration, and neutral, yielding a peak accuracy of 42% with neutral (chance = 20%) and 54% without neutral (chance = 25%). Individual detections of boredom, confusion, flow, and frustration, when contrasted with neutral affect, had maximum accuracies of 69, 68, 71, and 78%, respectively (chance = 50%). The classifiers that operated on the emotion judgments of the trained judges and combined models outperformed those based on judgments of the novices (i.e., the self and peer). Follow-up classification analyses that assessed the degree to which machine-generated affect labels correlated with affect judgments provided by humans revealed that human-machine agreement was on par with novice judges (self and peer) but quantitatively lower than trained judges. We discuss the prospects of extending AutoTutor into an affect-sensing ITS.
AB - We explored the reliability of detecting a learner's affect from conversational features extracted from interactions with AutoTutor, an intelligent tutoring system (ITS) that helps students learn by holding a conversation in natural language. Training data were collected in a learning session with AutoTutor, after which the affective states of the learner were rated by the learner, a peer, and two trained judges. Inter-rater reliability scores indicated that the classifications of the trained judges were more reliable than the novice judges. Seven data sets that temporally integrated the affective judgments with the dialogue features of each learner were constructed. The first four datasets corresponded to the judgments of the learner, a peer, and two trained judges, while the remaining three data sets combined judgments of two or more raters. Multiple regression analyses confirmed the hypothesis that dialogue features could significantly predict the affective states of boredom, confusion, flow, and frustration. Machine learning experiments indicated that standard classifiers were moderately successful in discriminating the affective states of boredom, confusion, flow, frustration, and neutral, yielding a peak accuracy of 42% with neutral (chance = 20%) and 54% without neutral (chance = 25%). Individual detections of boredom, confusion, flow, and frustration, when contrasted with neutral affect, had maximum accuracies of 69, 68, 71, and 78%, respectively (chance = 50%). The classifiers that operated on the emotion judgments of the trained judges and combined models outperformed those based on judgments of the novices (i.e., the self and peer). Follow-up classification analyses that assessed the degree to which machine-generated affect labels correlated with affect judgments provided by humans revealed that human-machine agreement was on par with novice judges (self and peer) but quantitatively lower than trained judges. We discuss the prospects of extending AutoTutor into an affect-sensing ITS.
KW - Affect detection
KW - AutoTutor
KW - Conversational cues
KW - Dialogue features
KW - Discourse markers
KW - Human-computer dialogue
KW - Human-computer interaction
KW - Intelligent Tutoring Systems
UR - http://www.scopus.com/inward/record.url?scp=38749099104&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=38749099104&partnerID=8YFLogxK
U2 - 10.1007/s11257-007-9037-6
DO - 10.1007/s11257-007-9037-6
M3 - Article
AN - SCOPUS:38749099104
VL - 18
SP - 45
EP - 80
JO - User Modeling and User-Adapted Interaction
JF - User Modeling and User-Adapted Interaction
SN - 0924-1868
IS - 1-2
ER -