Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.

Original languageEnglish (US)
Title of host publicationHT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media
PublisherAssociation for Computing Machinery, Inc
Number of pages9
ISBN (Print)9781450333955
StatePublished - Aug 24 2015
Event26th ACM Conference on Hypertext and Social Media, HT 2015 - Guzelyurt, Cyprus
Duration: Sep 1 2015Sep 4 2015


Other26th ACM Conference on Hypertext and Social Media, HT 2015


  • Clustering
  • Social media
  • Text processing

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction

Fingerprint Dive into the research topics of 'Surpassing the limit: Keyword clustering to improve Twitter sample coverage'. Together they form a unique fingerprint.

Cite this