Correcting Bias in Crowdsourced Data to Map Bicycle Ridership of All Bicyclists

Avipsa Roy; Trisalyn A. Nelson; A. Stewart Fotheringham; Meghan Winters

doi:10.3390/urbansci3020062

Correcting Bias in Crowdsourced Data to Map Bicycle Ridership of All Bicyclists

Avipsa Roy, Trisalyn A. Nelson, A. Stewart Fotheringham, Meghan Winters

Geographical Sciences and Urban Planning, School of (SGSUP)

Research output: Contribution to journal › Article › peer-review

45 Scopus citations

Abstract

Traditional methods of counting bicyclists are resource-intensive and generate data with sparse spatial and temporal detail. Previous research suggests big data from crowdsourced fitness apps offer a new source of bicycling data with high spatial and temporal resolution. However, crowdsourced bicycling data are biased as they oversample recreational riders. Our goals are to quantify geographical variables, which can help in correcting bias in crowdsourced, data and to develop a generalized method to correct bias in big crowdsourced data on bicycle ridership in different settings in order to generate maps for cities representative of all bicyclists at a street-level spatial resolution. We used street-level ridership data for 2016 from a crowdsourced fitness app (Strava), geographical covariate data, and official counts from 44 locations across Maricopa County, Arizona, USA (training data); and 60 locations from the city of Tempe, within Maricopa (test data). First, we quantified the relationship between Strava and official ridership data volumes. Second, we used a multi-step approach with variable selection using LASSO followed by Poisson regression to integrate geographical covariates, Strava, and training data to correct bias. Finally, we predicted bias-corrected average annual daily bicyclist counts for Tempe and evaluated the model’s accuracy using the test data. We found a correlation between the annual ridership data from Strava and official counts (R² = 0.76) in Maricopa County for 2016. The significant variables for correcting bias were: The proportion of white population, median household income, traffic speed, distance to residential areas, and distance to green spaces. The model could correct bias in crowdsourced data from Strava in Tempe with 86% of road segments being predicted within a margin of ±100 average annual bicyclists. Our results indicate that it is possible to map ridership for cities at the street-level by correcting bias in crowdsourced bicycle ridership data, with access to adequate data from official count programs and geographical covariates at a comparable spatial and temporal resolution.

Original language	English (US)
Article number	62
Journal	Urban Science
Volume	3
Issue number	2
DOIs	https://doi.org/10.3390/urbansci3020062
State	Published - Jun 2019

Keywords

LASSO
active transportation
bias correction
big data
crowdsourcing

ASJC Scopus subject areas

Environmental Science (miscellaneous)
Geography, Planning and Development
Pollution
Waste Management and Disposal
Urban Studies

Access to Document

10.3390/urbansci3020062

Cite this

@article{d473a84aac36425083db29d335883d5c,

title = "Correcting Bias in Crowdsourced Data to Map Bicycle Ridership of All Bicyclists",

abstract = "Traditional methods of counting bicyclists are resource-intensive and generate data with sparse spatial and temporal detail. Previous research suggests big data from crowdsourced fitness apps offer a new source of bicycling data with high spatial and temporal resolution. However, crowdsourced bicycling data are biased as they oversample recreational riders. Our goals are to quantify geographical variables, which can help in correcting bias in crowdsourced, data and to develop a generalized method to correct bias in big crowdsourced data on bicycle ridership in different settings in order to generate maps for cities representative of all bicyclists at a street-level spatial resolution. We used street-level ridership data for 2016 from a crowdsourced fitness app (Strava), geographical covariate data, and official counts from 44 locations across Maricopa County, Arizona, USA (training data); and 60 locations from the city of Tempe, within Maricopa (test data). First, we quantified the relationship between Strava and official ridership data volumes. Second, we used a multi-step approach with variable selection using LASSO followed by Poisson regression to integrate geographical covariates, Strava, and training data to correct bias. Finally, we predicted bias-corrected average annual daily bicyclist counts for Tempe and evaluated the model{\textquoteright}s accuracy using the test data. We found a correlation between the annual ridership data from Strava and official counts (R2 = 0.76) in Maricopa County for 2016. The significant variables for correcting bias were: The proportion of white population, median household income, traffic speed, distance to residential areas, and distance to green spaces. The model could correct bias in crowdsourced data from Strava in Tempe with 86% of road segments being predicted within a margin of ±100 average annual bicyclists. Our results indicate that it is possible to map ridership for cities at the street-level by correcting bias in crowdsourced bicycle ridership data, with access to adequate data from official count programs and geographical covariates at a comparable spatial and temporal resolution.",

keywords = "LASSO, active transportation, bias correction, big data, crowdsourcing",

author = "Avipsa Roy and Nelson, {Trisalyn A.} and Fotheringham, {A. Stewart} and Meghan Winters",

note = "Publisher Copyright: {\textcopyright} 2019 by the authors.",

year = "2019",

month = jun,

doi = "10.3390/urbansci3020062",

language = "English (US)",

volume = "3",

journal = "Urban Science",

issn = "2413-8851",

publisher = "MDPI AG",

number = "2",

}

TY - JOUR

T1 - Correcting Bias in Crowdsourced Data to Map Bicycle Ridership of All Bicyclists

AU - Roy, Avipsa

AU - Nelson, Trisalyn A.

AU - Fotheringham, A. Stewart

AU - Winters, Meghan

PY - 2019/6

Y1 - 2019/6

N2 - Traditional methods of counting bicyclists are resource-intensive and generate data with sparse spatial and temporal detail. Previous research suggests big data from crowdsourced fitness apps offer a new source of bicycling data with high spatial and temporal resolution. However, crowdsourced bicycling data are biased as they oversample recreational riders. Our goals are to quantify geographical variables, which can help in correcting bias in crowdsourced, data and to develop a generalized method to correct bias in big crowdsourced data on bicycle ridership in different settings in order to generate maps for cities representative of all bicyclists at a street-level spatial resolution. We used street-level ridership data for 2016 from a crowdsourced fitness app (Strava), geographical covariate data, and official counts from 44 locations across Maricopa County, Arizona, USA (training data); and 60 locations from the city of Tempe, within Maricopa (test data). First, we quantified the relationship between Strava and official ridership data volumes. Second, we used a multi-step approach with variable selection using LASSO followed by Poisson regression to integrate geographical covariates, Strava, and training data to correct bias. Finally, we predicted bias-corrected average annual daily bicyclist counts for Tempe and evaluated the model’s accuracy using the test data. We found a correlation between the annual ridership data from Strava and official counts (R2 = 0.76) in Maricopa County for 2016. The significant variables for correcting bias were: The proportion of white population, median household income, traffic speed, distance to residential areas, and distance to green spaces. The model could correct bias in crowdsourced data from Strava in Tempe with 86% of road segments being predicted within a margin of ±100 average annual bicyclists. Our results indicate that it is possible to map ridership for cities at the street-level by correcting bias in crowdsourced bicycle ridership data, with access to adequate data from official count programs and geographical covariates at a comparable spatial and temporal resolution.

AB - Traditional methods of counting bicyclists are resource-intensive and generate data with sparse spatial and temporal detail. Previous research suggests big data from crowdsourced fitness apps offer a new source of bicycling data with high spatial and temporal resolution. However, crowdsourced bicycling data are biased as they oversample recreational riders. Our goals are to quantify geographical variables, which can help in correcting bias in crowdsourced, data and to develop a generalized method to correct bias in big crowdsourced data on bicycle ridership in different settings in order to generate maps for cities representative of all bicyclists at a street-level spatial resolution. We used street-level ridership data for 2016 from a crowdsourced fitness app (Strava), geographical covariate data, and official counts from 44 locations across Maricopa County, Arizona, USA (training data); and 60 locations from the city of Tempe, within Maricopa (test data). First, we quantified the relationship between Strava and official ridership data volumes. Second, we used a multi-step approach with variable selection using LASSO followed by Poisson regression to integrate geographical covariates, Strava, and training data to correct bias. Finally, we predicted bias-corrected average annual daily bicyclist counts for Tempe and evaluated the model’s accuracy using the test data. We found a correlation between the annual ridership data from Strava and official counts (R2 = 0.76) in Maricopa County for 2016. The significant variables for correcting bias were: The proportion of white population, median household income, traffic speed, distance to residential areas, and distance to green spaces. The model could correct bias in crowdsourced data from Strava in Tempe with 86% of road segments being predicted within a margin of ±100 average annual bicyclists. Our results indicate that it is possible to map ridership for cities at the street-level by correcting bias in crowdsourced bicycle ridership data, with access to adequate data from official count programs and geographical covariates at a comparable spatial and temporal resolution.

KW - LASSO

KW - active transportation

KW - bias correction

KW - big data

KW - crowdsourcing

UR - http://www.scopus.com/inward/record.url?scp=85071785827&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071785827&partnerID=8YFLogxK

U2 - 10.3390/urbansci3020062

DO - 10.3390/urbansci3020062

M3 - Article

AN - SCOPUS:85071785827

SN - 2413-8851

VL - 3

JO - Urban Science

JF - Urban Science

IS - 2

M1 - 62

ER -

Correcting Bias in Crowdsourced Data to Map Bicycle Ridership of All Bicyclists

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this