Social media data grows greatly in size and variety as more and more people use social media such as Facebook, Twitter, LinkedIn, to name a few. It is a massive treasure trove interesting to researchers of different disciplines (computer science, mathematics, and social sciences), and a great source for data mining. However, attribute-value data in classic data mining differs from social media data besides both are largescale. It is noisy, incomplete, with multiple sources, and having multi-mode and multi-dimension networks. Its data points are not independent and identically distributed (or i.i.d.), but somewhat linked. These unique properties present unprecedented challenges for mining social media data. Existing feature selection algorithms that have been proven effective for data mining are unequipped for social media mining. Intellectual Merit. We propose a new type of feature selection to facilitate the computational understanding of social media, investigating associated fundamental research issues and developing new, effective algorithms: Feature Selection for Linked Social Media Data. We show the differences between conventional data and social media and illustrate the need for studying novel feature selection methods. We define the problem of feature selection with linked data and present some preliminary study that demonstrates the feasibility of developing new feature selection methods for social media and the benefit of using link information. We further propose to study (1) linked feature selection with instance selection, (2) linked feature selection considering tie strengths, and (3) linked feature selection based on social dimensions. Feature Selection with Multiple Sources in Social Media. A prominent character of social media is that its data can come from a range of multiple sources. Given the nature of social media, data of each source can be noisy, partial, or redundant, hence, selecting relevant sources and using them together can help effective feature selection. We propose to study feature selection of social media with source selection and a related problem - view selection as a source can often have more than one view. Feature Selection with Multi-Mode and Multi-Dimension Social Media Networks. Unique complications with social media include its multi-mode and multi-dimension networks. As ones network expands and ones relationships become more complicated, the discerning need intensifies as the complications can confuse feature selection algorithms. Our research will develop computational algorithms to enable the capability of efficiently handling multi-mode and multi-dimension social media networks.
The PI is currently funded by the NSF grant: IIS-1217466, Transforming Feature Selection to Harness the Power of Social Media. This supplement proposal seeks REU supplemental funding for two undergraduate students to serve the following three purposes: (1) providing a unique research opportunity for the participating undergraduate students by allowing them to work on a set of welldefined research and development tasks; (2) enhancing the study of the above project by enriching and delivering a high quality feature selection software package; and (3) involving undergraduate students to work on state-of-the-art research tasks and gain valuable research experience that will encourage them to continue graduate study at ASU or other universities. The above project aims at studying feature selection problems for social media data, including feature selection for linked data, feature selection with multiple sources and feature selection with multi-dimension and multi-mode networks. We have one specific and challenging problem, which is suitable for undergraduate students: developing a feature selection software package for traditional attribute-value data and social media data systematically to integrate popular existing algorithms to facilitate their applications and performing empirical study of comparison . Although numerous feature selection algorithms have been published , there is no toolbox of feature selection for social media data [3, 4]. The practical significance of this problem is three-fold: (1) facilitating the objective comparison of popular existing algorithms to study their characters and form guideline for their applications; (2) serving as a platform for benchmarking in new algorithm development; and (3) assisting the applications of feature selection in solving real problems including problems in social media. To ensure the success of the project, the PI has outlined a plan for performing not only technology development but also systematic testing and usability study. We have already created a feature selection repository1 funded by NSF in the past years and plan to expand and maintain it with the undergraduate researchers help and their input to improve it. This proposal intends to extend the scope of the above study by proposing additional research and development tasks that are appropriate for undergraduate students and that may further enhance the objectives of our current NSF project. The specific research related tasks of the students are elaborated below.
|Effective start/end date||9/1/12 → 8/31/16|
- National Science Foundation (NSF): $426,387.00