Automatic detection of learner's affect from conversational cues

Sidney K. D'Mello, Scotty Craig, Amy Johnson, Bethany McDaniel, Arthur Graesser

Research output: Contribution to journalArticle

190 Citations (Scopus)

Abstract

We explored the reliability of detecting a learner's affect from conversational features extracted from interactions with AutoTutor, an intelligent tutoring system (ITS) that helps students learn by holding a conversation in natural language. Training data were collected in a learning session with AutoTutor, after which the affective states of the learner were rated by the learner, a peer, and two trained judges. Inter-rater reliability scores indicated that the classifications of the trained judges were more reliable than the novice judges. Seven data sets that temporally integrated the affective judgments with the dialogue features of each learner were constructed. The first four datasets corresponded to the judgments of the learner, a peer, and two trained judges, while the remaining three data sets combined judgments of two or more raters. Multiple regression analyses confirmed the hypothesis that dialogue features could significantly predict the affective states of boredom, confusion, flow, and frustration. Machine learning experiments indicated that standard classifiers were moderately successful in discriminating the affective states of boredom, confusion, flow, frustration, and neutral, yielding a peak accuracy of 42% with neutral (chance = 20%) and 54% without neutral (chance = 25%). Individual detections of boredom, confusion, flow, and frustration, when contrasted with neutral affect, had maximum accuracies of 69, 68, 71, and 78%, respectively (chance = 50%). The classifiers that operated on the emotion judgments of the trained judges and combined models outperformed those based on judgments of the novices (i.e., the self and peer). Follow-up classification analyses that assessed the degree to which machine-generated affect labels correlated with affect judgments provided by humans revealed that human-machine agreement was on par with novice judges (self and peer) but quantitatively lower than trained judges. We discuss the prospects of extending AutoTutor into an affect-sensing ITS.

Original languageEnglish (US)
Pages (from-to)45-80
Number of pages36
JournalUser Modeling and User-Adapted Interaction
Volume18
Issue number1-2
DOIs
StatePublished - Feb 2008
Externally publishedYes

Fingerprint

Intelligent systems
Classifiers
boredom
Learning systems
frustration
Labels
Students
dialogue
Experiments
learning
conversation
emotion
regression
experiment
interaction
language
student

Keywords

  • Affect detection
  • AutoTutor
  • Conversational cues
  • Dialogue features
  • Discourse markers
  • Human-computer dialogue
  • Human-computer interaction
  • Intelligent Tutoring Systems

ASJC Scopus subject areas

  • Human-Computer Interaction

Cite this

Automatic detection of learner's affect from conversational cues. / D'Mello, Sidney K.; Craig, Scotty; Johnson, Amy; McDaniel, Bethany; Graesser, Arthur.

In: User Modeling and User-Adapted Interaction, Vol. 18, No. 1-2, 02.2008, p. 45-80.

Research output: Contribution to journalArticle

D'Mello, Sidney K. ; Craig, Scotty ; Johnson, Amy ; McDaniel, Bethany ; Graesser, Arthur. / Automatic detection of learner's affect from conversational cues. In: User Modeling and User-Adapted Interaction. 2008 ; Vol. 18, No. 1-2. pp. 45-80.
@article{fc709a0bb77a419093eec80f57733fcf,
title = "Automatic detection of learner's affect from conversational cues",
abstract = "We explored the reliability of detecting a learner's affect from conversational features extracted from interactions with AutoTutor, an intelligent tutoring system (ITS) that helps students learn by holding a conversation in natural language. Training data were collected in a learning session with AutoTutor, after which the affective states of the learner were rated by the learner, a peer, and two trained judges. Inter-rater reliability scores indicated that the classifications of the trained judges were more reliable than the novice judges. Seven data sets that temporally integrated the affective judgments with the dialogue features of each learner were constructed. The first four datasets corresponded to the judgments of the learner, a peer, and two trained judges, while the remaining three data sets combined judgments of two or more raters. Multiple regression analyses confirmed the hypothesis that dialogue features could significantly predict the affective states of boredom, confusion, flow, and frustration. Machine learning experiments indicated that standard classifiers were moderately successful in discriminating the affective states of boredom, confusion, flow, frustration, and neutral, yielding a peak accuracy of 42{\%} with neutral (chance = 20{\%}) and 54{\%} without neutral (chance = 25{\%}). Individual detections of boredom, confusion, flow, and frustration, when contrasted with neutral affect, had maximum accuracies of 69, 68, 71, and 78{\%}, respectively (chance = 50{\%}). The classifiers that operated on the emotion judgments of the trained judges and combined models outperformed those based on judgments of the novices (i.e., the self and peer). Follow-up classification analyses that assessed the degree to which machine-generated affect labels correlated with affect judgments provided by humans revealed that human-machine agreement was on par with novice judges (self and peer) but quantitatively lower than trained judges. We discuss the prospects of extending AutoTutor into an affect-sensing ITS.",
keywords = "Affect detection, AutoTutor, Conversational cues, Dialogue features, Discourse markers, Human-computer dialogue, Human-computer interaction, Intelligent Tutoring Systems",
author = "D'Mello, {Sidney K.} and Scotty Craig and Amy Johnson and Bethany McDaniel and Arthur Graesser",
year = "2008",
month = "2",
doi = "10.1007/s11257-007-9037-6",
language = "English (US)",
volume = "18",
pages = "45--80",
journal = "User Modeling and User-Adapted Interaction",
issn = "0924-1868",
publisher = "Springer Netherlands",
number = "1-2",

}

TY - JOUR

T1 - Automatic detection of learner's affect from conversational cues

AU - D'Mello, Sidney K.

AU - Craig, Scotty

AU - Johnson, Amy

AU - McDaniel, Bethany

AU - Graesser, Arthur

PY - 2008/2

Y1 - 2008/2

N2 - We explored the reliability of detecting a learner's affect from conversational features extracted from interactions with AutoTutor, an intelligent tutoring system (ITS) that helps students learn by holding a conversation in natural language. Training data were collected in a learning session with AutoTutor, after which the affective states of the learner were rated by the learner, a peer, and two trained judges. Inter-rater reliability scores indicated that the classifications of the trained judges were more reliable than the novice judges. Seven data sets that temporally integrated the affective judgments with the dialogue features of each learner were constructed. The first four datasets corresponded to the judgments of the learner, a peer, and two trained judges, while the remaining three data sets combined judgments of two or more raters. Multiple regression analyses confirmed the hypothesis that dialogue features could significantly predict the affective states of boredom, confusion, flow, and frustration. Machine learning experiments indicated that standard classifiers were moderately successful in discriminating the affective states of boredom, confusion, flow, frustration, and neutral, yielding a peak accuracy of 42% with neutral (chance = 20%) and 54% without neutral (chance = 25%). Individual detections of boredom, confusion, flow, and frustration, when contrasted with neutral affect, had maximum accuracies of 69, 68, 71, and 78%, respectively (chance = 50%). The classifiers that operated on the emotion judgments of the trained judges and combined models outperformed those based on judgments of the novices (i.e., the self and peer). Follow-up classification analyses that assessed the degree to which machine-generated affect labels correlated with affect judgments provided by humans revealed that human-machine agreement was on par with novice judges (self and peer) but quantitatively lower than trained judges. We discuss the prospects of extending AutoTutor into an affect-sensing ITS.

AB - We explored the reliability of detecting a learner's affect from conversational features extracted from interactions with AutoTutor, an intelligent tutoring system (ITS) that helps students learn by holding a conversation in natural language. Training data were collected in a learning session with AutoTutor, after which the affective states of the learner were rated by the learner, a peer, and two trained judges. Inter-rater reliability scores indicated that the classifications of the trained judges were more reliable than the novice judges. Seven data sets that temporally integrated the affective judgments with the dialogue features of each learner were constructed. The first four datasets corresponded to the judgments of the learner, a peer, and two trained judges, while the remaining three data sets combined judgments of two or more raters. Multiple regression analyses confirmed the hypothesis that dialogue features could significantly predict the affective states of boredom, confusion, flow, and frustration. Machine learning experiments indicated that standard classifiers were moderately successful in discriminating the affective states of boredom, confusion, flow, frustration, and neutral, yielding a peak accuracy of 42% with neutral (chance = 20%) and 54% without neutral (chance = 25%). Individual detections of boredom, confusion, flow, and frustration, when contrasted with neutral affect, had maximum accuracies of 69, 68, 71, and 78%, respectively (chance = 50%). The classifiers that operated on the emotion judgments of the trained judges and combined models outperformed those based on judgments of the novices (i.e., the self and peer). Follow-up classification analyses that assessed the degree to which machine-generated affect labels correlated with affect judgments provided by humans revealed that human-machine agreement was on par with novice judges (self and peer) but quantitatively lower than trained judges. We discuss the prospects of extending AutoTutor into an affect-sensing ITS.

KW - Affect detection

KW - AutoTutor

KW - Conversational cues

KW - Dialogue features

KW - Discourse markers

KW - Human-computer dialogue

KW - Human-computer interaction

KW - Intelligent Tutoring Systems

UR - http://www.scopus.com/inward/record.url?scp=38749099104&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=38749099104&partnerID=8YFLogxK

U2 - 10.1007/s11257-007-9037-6

DO - 10.1007/s11257-007-9037-6

M3 - Article

VL - 18

SP - 45

EP - 80

JO - User Modeling and User-Adapted Interaction

JF - User Modeling and User-Adapted Interaction

SN - 0924-1868

IS - 1-2

ER -