Mitigating the Impact of Data Sampling on Social Media Analysis and Mining

Kuai Xu, Feng Wang, Haiyan Wang, Yufang Wang, Ying Zhang

Research output: Contribution to journalArticle

Abstract

The last decade has witnessed the explosive growth of online social media in users and contents. Due to the unprecedented scale and the cascading power of the underlying social networks, social media has created a new paradigm for sharing information, broadcasting breaking news, and reporting real-time events by any user from anywhere at any time. Many popular social media sites including Twitter provide streaming data services by standard APIs to the broad researcher and developer communities. Given the sheer data volume, rapid velocity, and feature variety of online social media, these sites often supply only a sampled set of streaming data, rather than the full data set to reduce the resource cost of computations, storage, and network bandwidth. In light of the substantial impact of sampling in Twitter data stream, this article explores a combination of spectral clustering, locality-sensitive hashing (LSH), latent Dirichlet allocation (LDA) topic modeling, and differential equation modeling to mitigate the impact of sampling on social media data analysis, in particular on detecting real-world events and predicting information diffusion. Our extensive experiments demonstrate that our proposed method is able to detect effectively the real-time emerging events and predict accurately the cascading pattern of these events from the 1% sampled Twitter data stream. To the best of our knowledge, this article is the first effort to introduce a systematic methodology to study and mitigate the impact of data sampling on social media analysis and mining.

Original languageEnglish (US)
JournalIEEE Transactions on Computational Social Systems
DOIs
StateAccepted/In press - Jan 1 2020

    Fingerprint

Keywords

  • Analytical models
  • Big data
  • Clustering algorithms
  • Data mining
  • data sampling
  • Earthquakes
  • Real-time systems
  • social media analysis.
  • Twitter

ASJC Scopus subject areas

  • Modeling and Simulation
  • Social Sciences (miscellaneous)
  • Human-Computer Interaction

Cite this