[metrics-bugs] #29346 [Metrics/Website]: Document why our CSV files are in tidy/long format and how to process them

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Feb 5 20:38:49 UTC 2019


#29346: Document why our CSV files are in tidy/long format and how to process them
---------------------------------+--------------------------
     Reporter:  karsten          |      Owner:  metrics-team
         Type:  enhancement      |     Status:  new
     Priority:  Medium           |  Milestone:
    Component:  Metrics/Website  |    Version:
     Severity:  Normal           |   Keywords:
Actual Points:                   |  Parent ID:
       Points:                   |   Reviewer:
      Sponsor:                   |
---------------------------------+--------------------------
 This ticket is based on a discussion in Brussels.

 The issue we talked about is that it can sometimes be difficult to import
 our per-graph CSV files into applications like LibreOffice Calc or
 services like CKAN and make charts out of them.

 The reason is that we chose to use tidy/"long" data formats for our CSV
 files. For example, the following lines are contained in the
 relayflags.csv file:

 {{{
 date,flag,relays
 2007-10-27,Exit,602
 2007-10-27,Fast,1126
 2007-10-27,Guard,244
 2007-10-27,Running,1254
 2007-10-27,Stable,586
 2007-10-28,Exit,592
 2007-10-28,Fast,1115
 2007-10-28,Guard,293
 2007-10-28,Running,1244
 2007-10-28,Stable,578
 [...]
 }}}

 However, charting applications expect the data in the messy/"wide" format:

 {{{
 date,Exit,Fast,Guard,Running,Stable
 2007-10-27,602,1126,244,1254,586
 2007-10-28,592,1115,293,1244,578
 [...]
 }}}

 We briefly discussed in Brussels to change our formats accordingly, to
 please LibreOffice Calc et al. However, after giving this some more
 thoughts, I'm opposed to this idea.

 There are reasons why we picked the tidy format in the first place: it's
 more flexible, because we don't have to worry about having to add or
 remove columns at any time. It's also somewhat easier to handle with
 statistics tools/languages like R and the very powerful tidyverse
 libraries. See also Hadley Wickham's Tidy Data paper which is a really
 good read on this topic: https://www.jstatsoft.org/article/view/v059i10

 What can we do? I don't want to make the data harder to process for
 anyone, and sometimes LibreOffice Calc or CKAN can be great tools to get a
 first impression on a data set. We can also not expect everyone to use R
 or SPSS or MATLAB. But maybe we can solve this with better documentation
 rather than changing the way we're doing things.

 The magic word here seems to be: '''pivot table'''. This random blog post
 that I just found seems to be a good start for people wanting to wrangle
 our tidy data into whatever they need for making charts:
 https://blog.datawrapper.de/pivottables/

 And this random CKAN plugin that I did ''not'' try out could be a way to
 teach CKAN how to use our tidy data formats for its preview
 visualizations: https://github.com/routetopa/ckanext-pivottable

 So, how about we document the reasons for choosing tidy data formats on
 the Statistics page and linking to a few tutorials for processing our data
 with common charting tools? Ideally, we would add links rather than write
 a lot of text on our own, though.

 Does this sound plausible?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29346>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list