Internet and social media usher in a new era that demands fundamental changes in computing. As massive amounts of data are continually produced, information overload is now a way of life. However, noise has a pervasive presence in social media data. Noisy data can be detrimental to data analysis. Given the nature of social media data, its majority could be deemed as useless and irrelevant gibberish. For example, Twitter data could be over 90% of noise in a traditional sense as subsequent tweets after an original one usually do not add additional new content or information. The removal of such noise, however, may render the data useless for data analysis. We face a dilemma: while it makes perfect sense to remove noise, its removal can make the data invalid. In this STIR project, we propose to understand the following unexplored issues: (a) types of noise in social media and their distributions, (b) how noise behaves and its intricate relations with a variety of social media, and (c) how it affects the performance of data analysis; develop novel computational algorithms to explore noise removal by taking advantage of properties of social media data, strengths and weaknesses of noise removal and noise reduction (as its removal can be damaging to the data validity), and investigate under what circumstances we can tolerate noise in data analysis given noises multi-functional roles. We will perform both theoretical and empirical research on this project using real-world social media data. The findings of this STIR project will significantly improve our understanding of both salient and subtle differences between traditional data and social media data, inform us of what new computational methods should be developed, and pave way for large-scale research of human centric computing with collective intelligence.
|Effective start/end date||8/15/13 → 4/14/14|
- DOD-ARMY-ARL: Army Research Office (ARO): $50,000.00