How does the data sampling strategy impact the discovery of information diffusion in social media?

Munmun De Choudhury, Yu Ru Lin, Hari Sundaram, Kasim Candan, Lexing Xie, Aisling Kelliher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

135 Citations (Scopus)

Abstract

Platforms such as Twitter have provided researchers with ample opportunities to analytically study social phenomena. There are however, significant computational challenges due to the enormous rate of production of new information: researchers are therefore, often forced to analyze a judiciously selected "sample" of the data. Like other social media phenomena, information diffusion is a social process-it is affected by user context, and topic, in addition to the graph topology. This paper studies the impact of different attribute and topology based sampling strategies on the discovery of an important social media phenomena-information diffusion. We examine several widely-adopted sampling methods that select nodes based on attribute (random, location, and activity) and topology (forest fire) as well as study the impact of attribute based seed selection on topology based sampling. Then we develop a series of metrics for evaluating the quality of the sample, based on user activity (e.g. volume, number of seeds), topological (e.g. reach, spread) and temporal characteristics (e.g. rate). We additionally correlate the diffusion volume metric with two external variables-search and news trends. Our experiments reveal that for small sample sizes (30%), a sample that incorporates both topology and usercontext (e.g. location, activity) can improve on naïve methods by a significant margin of ∼15-20%.

Original languageEnglish (US)
Title of host publicationICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media
Pages34-41
Number of pages8
StatePublished - 2010
Event4th International AAAI Conference on Weblogs and Social Media, ICWSM 2010 - Washington, DC, United States
Duration: May 23 2010May 26 2010

Other

Other4th International AAAI Conference on Weblogs and Social Media, ICWSM 2010
CountryUnited States
CityWashington, DC
Period5/23/105/26/10

Fingerprint

Topology
Sampling
Seed
Fires
Experiments

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

De Choudhury, M., Lin, Y. R., Sundaram, H., Candan, K., Xie, L., & Kelliher, A. (2010). How does the data sampling strategy impact the discovery of information diffusion in social media? In ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (pp. 34-41)

How does the data sampling strategy impact the discovery of information diffusion in social media? / De Choudhury, Munmun; Lin, Yu Ru; Sundaram, Hari; Candan, Kasim; Xie, Lexing; Kelliher, Aisling.

ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. 2010. p. 34-41.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

De Choudhury, M, Lin, YR, Sundaram, H, Candan, K, Xie, L & Kelliher, A 2010, How does the data sampling strategy impact the discovery of information diffusion in social media? in ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. pp. 34-41, 4th International AAAI Conference on Weblogs and Social Media, ICWSM 2010, Washington, DC, United States, 5/23/10.
De Choudhury M, Lin YR, Sundaram H, Candan K, Xie L, Kelliher A. How does the data sampling strategy impact the discovery of information diffusion in social media? In ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. 2010. p. 34-41
De Choudhury, Munmun ; Lin, Yu Ru ; Sundaram, Hari ; Candan, Kasim ; Xie, Lexing ; Kelliher, Aisling. / How does the data sampling strategy impact the discovery of information diffusion in social media?. ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. 2010. pp. 34-41
@inproceedings{93d108b226dd410e9156c27674401f74,
title = "How does the data sampling strategy impact the discovery of information diffusion in social media?",
abstract = "Platforms such as Twitter have provided researchers with ample opportunities to analytically study social phenomena. There are however, significant computational challenges due to the enormous rate of production of new information: researchers are therefore, often forced to analyze a judiciously selected {"}sample{"} of the data. Like other social media phenomena, information diffusion is a social process-it is affected by user context, and topic, in addition to the graph topology. This paper studies the impact of different attribute and topology based sampling strategies on the discovery of an important social media phenomena-information diffusion. We examine several widely-adopted sampling methods that select nodes based on attribute (random, location, and activity) and topology (forest fire) as well as study the impact of attribute based seed selection on topology based sampling. Then we develop a series of metrics for evaluating the quality of the sample, based on user activity (e.g. volume, number of seeds), topological (e.g. reach, spread) and temporal characteristics (e.g. rate). We additionally correlate the diffusion volume metric with two external variables-search and news trends. Our experiments reveal that for small sample sizes (30{\%}), a sample that incorporates both topology and usercontext (e.g. location, activity) can improve on na{\"i}ve methods by a significant margin of ∼15-20{\%}.",
author = "{De Choudhury}, Munmun and Lin, {Yu Ru} and Hari Sundaram and Kasim Candan and Lexing Xie and Aisling Kelliher",
year = "2010",
language = "English (US)",
isbn = "9781577354451",
pages = "34--41",
booktitle = "ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media",

}

TY - GEN

T1 - How does the data sampling strategy impact the discovery of information diffusion in social media?

AU - De Choudhury, Munmun

AU - Lin, Yu Ru

AU - Sundaram, Hari

AU - Candan, Kasim

AU - Xie, Lexing

AU - Kelliher, Aisling

PY - 2010

Y1 - 2010

N2 - Platforms such as Twitter have provided researchers with ample opportunities to analytically study social phenomena. There are however, significant computational challenges due to the enormous rate of production of new information: researchers are therefore, often forced to analyze a judiciously selected "sample" of the data. Like other social media phenomena, information diffusion is a social process-it is affected by user context, and topic, in addition to the graph topology. This paper studies the impact of different attribute and topology based sampling strategies on the discovery of an important social media phenomena-information diffusion. We examine several widely-adopted sampling methods that select nodes based on attribute (random, location, and activity) and topology (forest fire) as well as study the impact of attribute based seed selection on topology based sampling. Then we develop a series of metrics for evaluating the quality of the sample, based on user activity (e.g. volume, number of seeds), topological (e.g. reach, spread) and temporal characteristics (e.g. rate). We additionally correlate the diffusion volume metric with two external variables-search and news trends. Our experiments reveal that for small sample sizes (30%), a sample that incorporates both topology and usercontext (e.g. location, activity) can improve on naïve methods by a significant margin of ∼15-20%.

AB - Platforms such as Twitter have provided researchers with ample opportunities to analytically study social phenomena. There are however, significant computational challenges due to the enormous rate of production of new information: researchers are therefore, often forced to analyze a judiciously selected "sample" of the data. Like other social media phenomena, information diffusion is a social process-it is affected by user context, and topic, in addition to the graph topology. This paper studies the impact of different attribute and topology based sampling strategies on the discovery of an important social media phenomena-information diffusion. We examine several widely-adopted sampling methods that select nodes based on attribute (random, location, and activity) and topology (forest fire) as well as study the impact of attribute based seed selection on topology based sampling. Then we develop a series of metrics for evaluating the quality of the sample, based on user activity (e.g. volume, number of seeds), topological (e.g. reach, spread) and temporal characteristics (e.g. rate). We additionally correlate the diffusion volume metric with two external variables-search and news trends. Our experiments reveal that for small sample sizes (30%), a sample that incorporates both topology and usercontext (e.g. location, activity) can improve on naïve methods by a significant margin of ∼15-20%.

UR - http://www.scopus.com/inward/record.url?scp=84890589151&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84890589151&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84890589151

SN - 9781577354451

SP - 34

EP - 41

BT - ICWSM 2010 - Proceedings of the 4th International AAAI Conference on Weblogs and Social Media

ER -