[metrics-bugs] #25383 [Metrics/Website]: Deprecate stats.html and stats/*.csv files

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue May 8 09:09:33 UTC 2018


#25383: Deprecate stats.html and stats/*.csv files
-----------------------------+------------------------------
 Reporter:  karsten          |          Owner:  metrics-team
     Type:  enhancement      |         Status:  new
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:                   |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+------------------------------

Comment (by karsten):

 Replying to [comment:12 irl]:
 > Are we strongly against the idea of providing two CSV files?

 I have been thinking a lot about this yesterday, and I think the answer
 is: yes.

 Providing two types of CSV files pretty much doubles our effort for adding
 new aggregations or graphs as well as changing or removing parts. I'd
 prefer the process for adding or improving graphs to get easier, not
 harder.

 Let's try to provide just one type of CSV files, assuming that we don't
 break existing, valid use cases.

 But let's find a way to stop providing our pre-aggregated statistics
 files. They are not the best interface that we can provide. And they are
 an interface that can become quite painful to maintain in the future.

 > I'd like to see the current CSV that only contains the data used to
 produce the plot, and then additionally the full CSV pre-filtering that
 would contain all the data.
 >
 > This would work for the use case where you want to do your own
 processing on the data and would also work for the use case where someone
 wanted to produce plots using the same data that we have already filtered
 and processed.
 >
 > For the full CSV file, a header would probably be useful. It may also be
 useful to have an HTML page that contains a list of all the available CSV
 files but the specifications for those files could be documented in the
 headers of the CSVs. We wouldn't need to list the individual pre-filtered
 CSV files on that page.

 Understood, I think.

 Here's another suggestion:

  4. We provide 1 CSV file per graph that is parameterized by default and
 that can also be requested without any parameters. The link on the graph
 page would contain the same parameters as the graph, so that the CSV file
 content would be pretty close to what's shown in the graph. Except that
 the file might contain a few more columns. But the header would explain
 those columns. And the header would also say that it's possible to drop
 parameters to get more data for different parameter combinations of this
 graph.

 Let's make this more concrete by adding sample data:

 The CSV link on the current [https://metrics.torproject.org/userstats-
 relay-country.html Relay users] graph page would read (line break added
 for visibility):

 {{{
 https://metrics.torproject.org/userstats-relay-country.csv?
     start=2018-02-07&end=2018-05-08&country=all&events=off
 }}}

 That first and last lines would be:

 {{{
 #
 # The Tor Project
 #
 # URL: https://metrics.torproject.org/userstats-relay-
 country.csv?start=2018-02-07&end=2018-05-08&country=all&events=off
 #
 # Insert some specification...
 #
 date,country,users,downturns,upturns,lower,upper
 2018-02-07,,4071868,,,,
 2018-02-08,,3815277,,,,
 2018-02-09,,4000274,,,,
 [...]
 2018-05-03,,2296101,,,,
 2018-05-04,,2341577,,,,
 2018-05-05,,2229328,,,,
 }}}

 Now, if someone's interested in date for all dates, a break-down by all
 possible countries, and possible censorship events, they'd simply take out
 all parameters and fetch the following file (link does not work yet):

 {{{
 https://metrics.torproject.org/userstats-relay-country.csv
 }}}

 The first and last lines would be:

 {{{
 #
 # The Tor Project
 #
 # URL: https://metrics.torproject.org/userstats-relay-country.csv
 #
 # Insert some specification...
 #
 date,country,users,downturns,upturns,lower,upper
 2011-03-06,a1,1443,,,,
 2011-03-06,a2,424,,,,
 2011-03-06,ae,8395,,,,
 [...]
 2018-05-06,zw,245,FALSE,FALSE,122,389
 2018-05-06,,2220344,,,,
 2018-05-06,??,25797,,,,
 }}}

 For comparison, the current CSV file, ''that we wouldn't provide
 anymore'', starts and ends with the following lines:

 {{{
 date,node,country,transport,version,lower,upper,clients,frac
 2011-03-06,relay,a1,,,,,1443,11
 2011-03-06,relay,a2,,,,,424,11
 2011-03-06,relay,ad,,,,,70,11
 [...]
 2018-05-06,bridge,,scramblesuit,,,,16,63
 2018-05-06,bridge,,snowflake,,,,3,63
 2018-05-06,bridge,??,,,,,1135,63
 }}}

 Note that the bridge user data would still be available on the various
 bridge users graphs.

 And we could discuss whether it makes sense to include the `frac` column
 in the relay users CSV file or not. If we include it, it would be there in
 the parameterized CSV file as well as the non-parameterized CSV file. I
 guess this is a trade-off between usability ("less is more") and
 usefulness ("more details can help").

 Thoughts?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25383#comment:13>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list