Abstract

Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.

Original languageEnglish (US)
Title of host publicationHT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media
PublisherAssociation for Computing Machinery, Inc
Pages237-245
Number of pages9
ISBN (Print)9781450333955
DOIs
StatePublished - Aug 24 2015
Event26th ACM Conference on Hypertext and Social Media, HT 2015 - Guzelyurt, Cyprus
Duration: Sep 1 2015Sep 4 2015

Other

Other26th ACM Conference on Hypertext and Social Media, HT 2015
CountryCyprus
CityGuzelyurt
Period9/1/159/4/15

Fingerprint

Sampling

Keywords

  • Clustering
  • Social media
  • Text processing

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction

Cite this

Sampson, J., Morstatter, F., Maciejewski, R., & Liu, H. (2015). Surpassing the limit: Keyword clustering to improve Twitter sample coverage. In HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media (pp. 237-245). Association for Computing Machinery, Inc. https://doi.org/10.1145/2700171.2791030

Surpassing the limit : Keyword clustering to improve Twitter sample coverage. / Sampson, Justin; Morstatter, Fred; Maciejewski, Ross; Liu, Huan.

HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media. Association for Computing Machinery, Inc, 2015. p. 237-245.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sampson, J, Morstatter, F, Maciejewski, R & Liu, H 2015, Surpassing the limit: Keyword clustering to improve Twitter sample coverage. in HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media. Association for Computing Machinery, Inc, pp. 237-245, 26th ACM Conference on Hypertext and Social Media, HT 2015, Guzelyurt, Cyprus, 9/1/15. https://doi.org/10.1145/2700171.2791030
Sampson J, Morstatter F, Maciejewski R, Liu H. Surpassing the limit: Keyword clustering to improve Twitter sample coverage. In HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media. Association for Computing Machinery, Inc. 2015. p. 237-245 https://doi.org/10.1145/2700171.2791030
Sampson, Justin ; Morstatter, Fred ; Maciejewski, Ross ; Liu, Huan. / Surpassing the limit : Keyword clustering to improve Twitter sample coverage. HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media. Association for Computing Machinery, Inc, 2015. pp. 237-245
@inproceedings{3c57730c6cb54c74acdfe720f7793b51,
title = "Surpassing the limit: Keyword clustering to improve Twitter sample coverage",
abstract = "Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.",
keywords = "Clustering, Social media, Text processing",
author = "Justin Sampson and Fred Morstatter and Ross Maciejewski and Huan Liu",
year = "2015",
month = "8",
day = "24",
doi = "10.1145/2700171.2791030",
language = "English (US)",
isbn = "9781450333955",
pages = "237--245",
booktitle = "HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Surpassing the limit

T2 - Keyword clustering to improve Twitter sample coverage

AU - Sampson, Justin

AU - Morstatter, Fred

AU - Maciejewski, Ross

AU - Liu, Huan

PY - 2015/8/24

Y1 - 2015/8/24

N2 - Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.

AB - Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.

KW - Clustering

KW - Social media

KW - Text processing

UR - http://www.scopus.com/inward/record.url?scp=84951875749&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84951875749&partnerID=8YFLogxK

U2 - 10.1145/2700171.2791030

DO - 10.1145/2700171.2791030

M3 - Conference contribution

SN - 9781450333955

SP - 237

EP - 245

BT - HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media

PB - Association for Computing Machinery, Inc

ER -