[tor-scaling] Nov 20 meeting recap: metrics analysis needed

Rob Jansen rob.g.jansen at nrl.navy.mil
Wed Jan 15 23:29:29 UTC 2020

> On Jan 13, 2020, at 9:23 PM, Mike Perry <mikeperry at torproject.org> wrote:
>> - This graph shows just the data from a single OnionPerf source. If we
>> add multiple sources, it gets even more overloaded. But we can't really
>> mix numbers from different sources, as they have their very own
>> connection characteristics that would skew results.
>> Happy to make more graphs. It might help to see a sketch or longer
>> description of what you expect to see. Thanks!
> Hrm, I am not a data visualization expert, but what is most important
> for us to understand is the nature of the variance of performance,
> including the length of the long tail.
> From your above plots, it looks like the experiment primarily negatively
> impacted the long tail of perf, and maybe even 95-99% perf, but not
> average perf. But I agree, even this much is hard to tell due to the
> scale needed to display the full tail in CDF form. Perhaps this means a
> clip at like 5-10 seconds for all graphs, to keep the X axis the same
> length, and then some additional way to quantify the length and quantity
> of the tail beyond the clip.
> Basically, we want to be able to see if 0-99% CDF slope became wider or
> got additional lumps, and we want to see if the 1% tail got longer or
> shorter (and ideally also check if it has similar membership and data
> points over time in terms of participant relays and time values, for
> bug-hunting analysis).
> We should definitely play around with a few different graphing methods
> though, to compare various ways of capturing this info.

I suggest that you abandon the CDFs, and use boxplots instead!

The y-axis can show the download time, and the x-axis can have one box per time period (moving to the right one spot means you move to the next time period and a new box). Each box encodes the CDF from that day, except as a boxplot that shows the 1st-3rd quartiles as the box, the error bars can extend from 0 - 99% percentiles, and you can add in the median and mean, and you could even show the outliers above the 99th percentile if you want.

The boxplots will allow you to get a sense of the range of the distribution, and also the skew.

I have not done these in R, but I've attached an example from python-matplotlib, which can also be found in Figure 5(c) in a recent paper:

(In my case I was varying the number of attack circuits in each box, but I hope you get the idea.)

Peace, love, and positivity,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20200115/aadcce62/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: boxplot-example-pointbreak.png
Type: image/png
Size: 59499 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20200115/aadcce62/attachment-0001.png>

More information about the tor-scaling mailing list