Surpassing the limit: Keyword clustering to improve Twitter sample coverage

Justin Sampson; Fred Morstatter; Ross Maciejewski; Huan Liu

doi:10.1145/2700171.2791030

Surpassing the limit: Keyword clustering to improve Twitter sample coverage

Justin Sampson, Fred Morstatter, Ross Maciejewski, Huan Liu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

18 Scopus citations

Abstract

Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.

Original language	English (US)
Title of host publication	HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media
Publisher	Association for Computing Machinery, Inc
Pages	237-245
Number of pages	9
ISBN (Electronic)	9781450333955
DOIs	https://doi.org/10.1145/2700171.2791030
State	Published - Aug 24 2015
Event	26th ACM Conference on Hypertext and Social Media, HT 2015 - Guzelyurt, Cyprus Duration: Sep 1 2015 → Sep 4 2015

Publication series

Name	HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media

Conference

Conference	26th ACM Conference on Hypertext and Social Media, HT 2015
Country/Territory	Cyprus
City	Guzelyurt
Period	9/1/15 → 9/4/15

Keywords

Clustering
Social media
Text processing

ASJC Scopus subject areas

Artificial Intelligence
Software
Computer Graphics and Computer-Aided Design
Human-Computer Interaction

Access to Document

10.1145/2700171.2791030

Cite this

Sampson, J., Morstatter, F., Maciejewski, R., & Liu, H. (2015). Surpassing the limit: Keyword clustering to improve Twitter sample coverage. In HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media (pp. 237-245). (HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media). Association for Computing Machinery, Inc. https://doi.org/10.1145/2700171.2791030

Surpassing the limit: Keyword clustering to improve Twitter sample coverage. / Sampson, Justin; Morstatter, Fred; Maciejewski, Ross et al.
HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media. Association for Computing Machinery, Inc, 2015. p. 237-245 (HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Sampson, J, Morstatter, F, Maciejewski, R & Liu, H 2015, Surpassing the limit: Keyword clustering to improve Twitter sample coverage. in HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media. HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media, Association for Computing Machinery, Inc, pp. 237-245, 26th ACM Conference on Hypertext and Social Media, HT 2015, Guzelyurt, Cyprus, 9/1/15. https://doi.org/10.1145/2700171.2791030

Sampson J, Morstatter F, Maciejewski R , Liu H. Surpassing the limit: Keyword clustering to improve Twitter sample coverage. In HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media. Association for Computing Machinery, Inc. 2015. p. 237-245. (HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media). doi: 10.1145/2700171.2791030

@inproceedings{3c57730c6cb54c74acdfe720f7793b51,

title = "Surpassing the limit: Keyword clustering to improve Twitter sample coverage",

abstract = "Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.",

keywords = "Clustering, Social media, Text processing",

author = "Justin Sampson and Fred Morstatter and Ross Maciejewski and Huan Liu",

note = "Funding Information: This work is sponsored, in part, by O_ce of Naval Re- search grant N000141410095. Publisher Copyright: {\textcopyright} 2015 ACM.; 26th ACM Conference on Hypertext and Social Media, HT 2015 ; Conference date: 01-09-2015 Through 04-09-2015",

year = "2015",

month = aug,

day = "24",

doi = "10.1145/2700171.2791030",

language = "English (US)",

series = "HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media",

publisher = "Association for Computing Machinery, Inc",

pages = "237--245",

booktitle = "HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media",

}

TY - GEN

T1 - Surpassing the limit

T2 - 26th ACM Conference on Hypertext and Social Media, HT 2015

AU - Sampson, Justin

AU - Morstatter, Fred

AU - Maciejewski, Ross

AU - Liu, Huan

PY - 2015/8/24

Y1 - 2015/8/24

N2 - Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.

AB - Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.

KW - Clustering

KW - Social media

KW - Text processing

UR - http://www.scopus.com/inward/record.url?scp=84951875749&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84951875749&partnerID=8YFLogxK

U2 - 10.1145/2700171.2791030

DO - 10.1145/2700171.2791030

M3 - Conference contribution

AN - SCOPUS:84951875749

T3 - HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media

SP - 237

EP - 245

BT - HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media

PB - Association for Computing Machinery, Inc

Y2 - 1 September 2015 through 4 September 2015

ER -

Surpassing the limit: Keyword clustering to improve Twitter sample coverage

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this