A week ago I published the presentation I had given during the ENERGIC Workshop at the University of Zurich. In that I talked about crowdsourcing and VGI and about using crowdsourced data in applications.
I’d like to delve deeper into one of the examples I have talk about in my workshop presentation: Some weeks ago, Strava, a provider of a fitness tracking app for bicyclists and joggers, published a so-called heatmap of all recorded GPS locations of its users. Similar data has been published by comparable providers such as Runkeeper.
For what purposes can we use such crowdsourced data? Which questions might they help answer? As an example, I have coded up a small web map that overlays the bicycle infrastructure of the city of Zurich (thank you, Open Data team) onto the Strava heatmap. You can explore this map here below (full-screen view can be toggled using the button on the left, below the zoom controls):
- In its majority the Strava data encompasses cycle routes (77,688,848 globally), but not just (globally, there are 19,660,163 recorded jogging routes).
- How sure can we be that users have switched their app to the correct tracking mode (cycling vs. jogging)?
- Are there users who use their fitness tracking app for recording their car-based travels, their motorbike excursions or their sunday walks with the dog? If yes, does Strava take steps in order to remove such recordings from their user data, for example by filtering based on velocity/acceleration characteristics?
- How many users have contributed their data in the Zurich city area, in a neighbourhood, on a much-used/little-used part of the network?
- Are there much-used routes that – counter-intuitively – have been used and recorded by only a limited (but enthusiastic) bunch of cyclists?
- On the other hand: are there much-used routes that have better „democratic legitimacy“, i.e. that have been used by a large number of different people on different occasions?
- If we could detect and distinguish these two kinds of routes: what would this tell us?
- What kind of recording errors (e.g. insufficient GPS coverage, multiple GPS reflections off buildings or trees) might be present in the data? How would such errors influence our intended analyses?
- What does the temporal distribution (currency) of the data look like? Has the majority of the data been collected over the last three years, the last year, the last six months? And how might an uneven temporal distribution influence insights from our intended analyses? (Three years ago an estimate said that 10% of all existing photographs have been taken during the last 12 months! The temporally skewed distribution of Flickr photographs is a well-known fact.)
Generally it’s advisable that you try to have maximum control on the data production and collection process. If you rely on data provided by a third-party, please inform yourself about the processing steps that the data has undergone (such as sampling, filtering, removal of ‚outliers‘, etc.). Only with this knowledge the fitness for use of a particular data set can be adequately judged.
Suggested reading: Timo Grossenbacher discusses representativity of crowdsourced data extensively in his Master’s thesis as well as in this blog post.
A few days after my blog post Patrick Traughber posted a comparison of Strava data with Human.co data on Twitter. An interesting discussion ensued.
* Strava mentions that you might need different, more detailed data for certain analyses and they provide anonymised raw data under their Metro brand especially for such applications.