Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach

H. Andrew Schwartz; Johannes Eichstaedt; Lukasz Dziurzynski; Eduardo Blanco; Margaret L. Kern; Stephanie Ramones; Martin Seligman; Lyle Ungar

Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach

H. Andrew Schwartz, Johannes Eichstaedt, Lukasz Dziurzynski, Eduardo Blanco, Margaret L. Kern, Stephanie Ramones, Martin Seligman, Lyle Ungar

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

Social scientists are increasingly using the vast amount of text available on social media to measure variation in happiness and other psychological states. Such studies count words deemed to be indicators of happiness and track how the word frequencies change across locations or time. This word count approach is simple and scalable, yet often picks up false signals, as words can appear in different contexts and take on different meanings. We characterize the types of errors that occur using the word count approach, and find lexical ambiguity to be the most prevalent. We then show that one can reduce error with a simple refinement to such lexica by automatically eliminating highly ambiguous words. The resulting refined lexica improve precision as measured by human judgments of word occurrences in Facebook posts.

Original language	English (US)
Title of host publication	SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task
Subtitle of host publication	Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity
Editors	Mona Diab, Tim Baldwin, Marco Baroni
Publisher	Association for Computational Linguistics (ACL)
Pages	296-305
Number of pages	10
ISBN (Electronic)	9781937284480
State	Published - 2013
Externally published	Yes
Event	2nd Joint Conference on Lexical and Computational Semantics, SEM 2013 - Atlanta, United States Duration: Jun 13 2013 → Jun 14 2013

Publication series

Name	SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

Conference

Conference	2nd Joint Conference on Lexical and Computational Semantics, SEM 2013
Country/Territory	United States
City	Atlanta
Period	6/13/13 → 6/14/13

ASJC Scopus subject areas

Computer Networks and Communications
Computer Science Applications
Information Systems

Cite this

Schwartz, H. A., Eichstaedt, J., Dziurzynski, L., Blanco, E., Kern, M. L., Ramones, S., Seligman, M., & Ungar, L. (2013). Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach. In M. Diab, T. Baldwin, & M. Baroni (Eds.), SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity (pp. 296-305). (SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity). Association for Computational Linguistics (ACL).

Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach. / Schwartz, H. Andrew; Eichstaedt, Johannes; Dziurzynski, Lukasz et al.
SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. ed. / Mona Diab; Tim Baldwin; Marco Baroni. Association for Computational Linguistics (ACL), 2013. p. 296-305 (SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Schwartz, HA, Eichstaedt, J, Dziurzynski, L, Blanco, E, Kern, ML, Ramones, S, Seligman, M & Ungar, L 2013, Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach. in M Diab, T Baldwin & M Baroni (eds), SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Association for Computational Linguistics (ACL), pp. 296-305, 2nd Joint Conference on Lexical and Computational Semantics, SEM 2013, Atlanta, United States, 6/13/13.

Schwartz HA, Eichstaedt J, Dziurzynski L, Blanco E, Kern ML, Ramones S et al. Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach. In Diab M, Baldwin T, Baroni M, editors, SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. Association for Computational Linguistics (ACL). 2013. p. 296-305. (SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity).

Schwartz, H. Andrew ; Eichstaedt, Johannes ; Dziurzynski, Lukasz et al. / Choosing the RightWords : Characterizing and Reducing Error of theWord Count Approach. SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. editor / Mona Diab ; Tim Baldwin ; Marco Baroni. Association for Computational Linguistics (ACL), 2013. pp. 296-305 (SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity).

@inproceedings{5288b7db1e0c442580a258fb75627814,

title = "Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach",

abstract = "Social scientists are increasingly using the vast amount of text available on social media to measure variation in happiness and other psychological states. Such studies count words deemed to be indicators of happiness and track how the word frequencies change across locations or time. This word count approach is simple and scalable, yet often picks up false signals, as words can appear in different contexts and take on different meanings. We characterize the types of errors that occur using the word count approach, and find lexical ambiguity to be the most prevalent. We then show that one can reduce error with a simple refinement to such lexica by automatically eliminating highly ambiguous words. The resulting refined lexica improve precision as measured by human judgments of word occurrences in Facebook posts. ",

author = "Schwartz, {H. Andrew} and Johannes Eichstaedt and Lukasz Dziurzynski and Eduardo Blanco and Kern, {Margaret L.} and Stephanie Ramones and Martin Seligman and Lyle Ungar",

note = "Funding Information: Support for this research was provided by the Robert Wood Johnson Foundation{\textquoteright}s Pioneer Portfolio, through a grant to Martin Seligman, “Exploring Concepts of Positive Health”. We thank the reviewers for their constructive and insightful comments. Publisher Copyright: {\textcopyright}2013 Association for Computational Linguistics.; 2nd Joint Conference on Lexical and Computational Semantics, SEM 2013 ; Conference date: 13-06-2013 Through 14-06-2013",

year = "2013",

language = "English (US)",

series = "SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity",

publisher = "Association for Computational Linguistics (ACL)",

pages = "296--305",

editor = "Mona Diab and Tim Baldwin and Marco Baroni",

booktitle = "SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task",

}

TY - GEN

T1 - Choosing the RightWords

T2 - 2nd Joint Conference on Lexical and Computational Semantics, SEM 2013

AU - Schwartz, H. Andrew

AU - Eichstaedt, Johannes

AU - Dziurzynski, Lukasz

AU - Blanco, Eduardo

AU - Kern, Margaret L.

AU - Ramones, Stephanie

AU - Seligman, Martin

AU - Ungar, Lyle

N1 - Funding Information: Support for this research was provided by the Robert Wood Johnson Foundation’s Pioneer Portfolio, through a grant to Martin Seligman, “Exploring Concepts of Positive Health”. We thank the reviewers for their constructive and insightful comments. Publisher Copyright: ©2013 Association for Computational Linguistics.

PY - 2013

Y1 - 2013

N2 - Social scientists are increasingly using the vast amount of text available on social media to measure variation in happiness and other psychological states. Such studies count words deemed to be indicators of happiness and track how the word frequencies change across locations or time. This word count approach is simple and scalable, yet often picks up false signals, as words can appear in different contexts and take on different meanings. We characterize the types of errors that occur using the word count approach, and find lexical ambiguity to be the most prevalent. We then show that one can reduce error with a simple refinement to such lexica by automatically eliminating highly ambiguous words. The resulting refined lexica improve precision as measured by human judgments of word occurrences in Facebook posts.

AB - Social scientists are increasingly using the vast amount of text available on social media to measure variation in happiness and other psychological states. Such studies count words deemed to be indicators of happiness and track how the word frequencies change across locations or time. This word count approach is simple and scalable, yet often picks up false signals, as words can appear in different contexts and take on different meanings. We characterize the types of errors that occur using the word count approach, and find lexical ambiguity to be the most prevalent. We then show that one can reduce error with a simple refinement to such lexica by automatically eliminating highly ambiguous words. The resulting refined lexica improve precision as measured by human judgments of word occurrences in Facebook posts.

UR - http://www.scopus.com/inward/record.url?scp=85123688242&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85123688242&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85123688242

T3 - SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

SP - 296

EP - 305

BT - SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task

A2 - Diab, Mona

A2 - Baldwin, Tim

A2 - Baroni, Marco

PB - Association for Computational Linguistics (ACL)

Y2 - 13 June 2013 through 14 June 2013

ER -

Choosing the RightWords: Characterizing and Reducing Error of theWord Count Approach

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Cite this