Mitigating the Impact of Data Sampling on Social Media Analysis and Mining

Kuai Xu, Feng Wang, Haiyan Wang, Yufang Wang, Ying Zhang

Research output: Contribution to journalArticlepeer-review

6 Scopus citations


The last decade has witnessed the explosive growth of online social media in users and contents. Due to the unprecedented scale and the cascading power of the underlying social networks, social media has created a new paradigm for sharing information, broadcasting breaking news, and reporting real-time events by any user from anywhere at any time. Many popular social media sites including Twitter provide streaming data services by standard APIs to the broad researcher and developer communities. Given the sheer data volume, rapid velocity, and feature variety of online social media, these sites often supply only a sampled set of streaming data, rather than the full data set to reduce the resource cost of computations, storage, and network bandwidth. In light of the substantial impact of sampling in Twitter data stream, this article explores a combination of spectral clustering, locality-sensitive hashing (LSH), latent Dirichlet allocation (LDA) topic modeling, and differential equation modeling to mitigate the impact of sampling on social media data analysis, in particular on detecting real-world events and predicting information diffusion. Our extensive experiments demonstrate that our proposed method is able to detect effectively the real-time emerging events and predict accurately the cascading pattern of these events from the 1% sampled Twitter data stream. To the best of our knowledge, this article is the first effort to introduce a systematic methodology to study and mitigate the impact of data sampling on social media analysis and mining.

Original languageEnglish (US)
Article number9001215
Pages (from-to)546-555
Number of pages10
JournalIEEE Transactions on Computational Social Systems
Issue number2
StatePublished - Apr 2020


  • Big data
  • Data sampling
  • Social media analysis

ASJC Scopus subject areas

  • Modeling and Simulation
  • Social Sciences (miscellaneous)
  • Human-Computer Interaction


Dive into the research topics of 'Mitigating the Impact of Data Sampling on Social Media Analysis and Mining'. Together they form a unique fingerprint.

Cite this