Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose

Fred Morstatter, Jürgen Pfeffer, Huan Liu, Kathleen M. Carley

Research output: Chapter in Book/Report/Conference proceedingConference contribution

  • 181 Citations

Abstract

Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013
PublisherAAAI press
Pages400-408
Number of pages9
StatePublished - 2013
Event7th International AAAI Conference on Weblogs and Social Media, ICWSM 2013 - Cambridge, MA, United States

Other

Other7th International AAAI Conference on Weblogs and Social Media, ICWSM 2013
CountryUnited States
CityCambridge, MA
Period7/8/137/11/13

Fingerprint

Application programming interfaces (API)
Industry

ASJC Scopus subject areas

  • Media Technology

Cite this

Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose. In Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013 (pp. 400-408). AAAI press.

Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose. / Morstatter, Fred; Pfeffer, Jürgen; Liu, Huan; Carley, Kathleen M.

Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013. AAAI press, 2013. p. 400-408.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Morstatter, F, Pfeffer, J, Liu, H & Carley, KM 2013, Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose. in Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013. AAAI press, pp. 400-408, 7th International AAAI Conference on Weblogs and Social Media, ICWSM 2013, Cambridge, MA, United States, 8-11 July.
Morstatter F, Pfeffer J, Liu H, Carley KM. Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose. In Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013. AAAI press. 2013. p. 400-408.

Morstatter, Fred; Pfeffer, Jürgen; Liu, Huan; Carley, Kathleen M. / Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose.

Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013. AAAI press, 2013. p. 400-408.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

@inbook{a3d68974958f4ba3a4e9e6b589c02333,
title = "Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose",
abstract = "Twitter is a social media giant famous for the exchange of short, 140-character messages called {"}tweets{"}. In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a {"}Streaming API{"} which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.",
author = "Fred Morstatter and Jürgen Pfeffer and Huan Liu and Carley, {Kathleen M.}",
year = "2013",
pages = "400--408",
booktitle = "Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013",
publisher = "AAAI press",

}

TY - CHAP

T1 - Is the sample good enough? Comparing data from twitter's streaming API with Twitter's firehose

AU - Morstatter,Fred

AU - Pfeffer,Jürgen

AU - Liu,Huan

AU - Carley,Kathleen M.

PY - 2013

Y1 - 2013

N2 - Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

AB - Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

UR - http://www.scopus.com/inward/record.url?scp=84892704954&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84892704954&partnerID=8YFLogxK

M3 - Conference contribution

SP - 400

EP - 408

BT - Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013

PB - AAAI press

ER -