[metrics-bugs] #33258 [Metrics]: Add CSV file export of graphed data

Tor Bug Tracker & Wiki blackhole at torproject.org
Fri Mar 20 22:39:27 UTC 2020

#33258: Add CSV file export of graphed data
 Reporter:  karsten                      |          Owner:  metrics-team
     Type:  enhancement                  |         Status:  needs_review
 Priority:  Medium                       |      Milestone:
Component:  Metrics                      |        Version:
 Severity:  Normal                       |     Resolution:
 Keywords:  metrics-team-roadmap-2020Q1  |  Actual Points:
Parent ID:  #33327                       |         Points:  1
 Reviewer:                               |        Sponsor:  Sponsor59
Changes (by karsten):

 * status:  new => needs_review


 I started working on #33256 and #33258 in parallel and now have a patch
 for review that implements #33258. This patch contains a rewrite of tgen
 plots to use the pandas and seaborn libraries.

 On the plus side of using pandas, it's really easy to export graphed data
 to a .csv file. And it's in general good practice to separate all data
 tidying from visualization.

 On the minus side, I didn't want to update all the old PyLab code to use
 pandas. Instead I switched that code to using seaborn, which is much newer
 and much higher-level. The code is much shorter and easier to read. But it
 comes with a few changes to produced plots that we need to discuss.

 I attached
 old] and
 new] tgen plots as .pdf files. Changes are:

  - ECDFs "Time to download first byte" and "Time to download last of {
 51200, 1048576, 5242880 } bytes" remain mostly unchanged. One very minor
 change is that lines now extend to (-Inf, 0) and (Inf, 1). As before,
 these plots are set to focus on values up to the 99th percentile.

  - Time plots "Time to download { first, last } of { 51200, 1048576,
 5242880 } bytes over time" are roughly the same as the "mean time to
 download [...]" plots. Noticeable differences are that the x axis uses
 datetime values rather than "ticks" and that the plot has changed from
 line to scatter plot. The rationale behind switching from lines to dots is
 that measurements are mostly independent from each other. This fact is
 better expressed by using a single dot per measurement rather than shorter
 or longer lines depending on how different subsequent measurements were
 and how much time has passed between those measurements.

  - Box plots "Time to download last of { 51200, 1048576, 5242880 } bytes"
 replace the "median time to download [...]" plots by giving more detail
 than just the median. They do not, however, show maxima or even any
 outliers at all, because extreme outliers can make it difficult to read
 the median value.

  - Bar plots "Mean time to download last of { 51200, 1048576, 5242880 }
 bytes" replace the "mean time to download [...]" plots. It's questionable
 whether these plots are still required with the box plots being present.

  - There are no equivalents for "max time to download [...]" plots,
 because the maximum download time can also be obtained from time plots. If
 having plots with download time maxima is for some reason important, they
 could be re-added as bar plots.

  - Count plots "Number of downloads of { 51200, 1048576, 5242880 } bytes
 completed" replace their similarly named equivalents but are much more

  - There are no equivalents for "number of { 51200, 1048576, 5242880 }
 bytes completed, all clients over time". These time plots are basically
 the same as the time plots showing download time, except that those have
 useful y values which these don't have.

  - Count plots "Number of downloads failed" and time plot "Download
 runtime until error" replace the various error graphs which didn't seem to
 be as useful.

 Regarding tor plots I'm a bit unclear why we would need them at all. I
 attached the
 old] tor plots as .pdf file for discussion here. I did not yet rewrite
 this code, because maybe we can kill it right away. Some notes:

  - The "60 second moving average [...]" graphs are currently broken. The x
 axis is supposed to be the time in seconds, but it starts at unix time 0
 or 1970-01-01. If you look veeeeery closely at the space right to the
 legend you'll find the data points. However, I don't know how this
 visualization can be useful for anything besides debugging a handful of

  - The "1 second throughput [...]" graphs would be more useful with a
 higher data resolution than 1 KiB/s, which is the reason for those huge
 steps. But even if the ECDFs would be smoother, is this really something
 we care about?

 I attached my Git-formatted
 -Rewrite-tgen-plots-to-use-pandas-and-seaborn.patch patch] for review;
 looks like I don't have an OnionPerf repository yet. But maybe we can have
 a higher-level discussion of the items above first before diving deep into
 the code review.

 Just in case somebody wants to reproduce these plots, here are the
 commands I used:

 python setup.py build
 sudo python setup.py install
 onionperf visualize \
   -d 2019-01-31.onionperf.analysis.json.xz 2019-01-31-ab \
   -d 2019-01-30.onionperf.analysis.json.xz 2019-01-30-nl

Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33258#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online

More information about the metrics-bugs mailing list