[tor-scaling] Analyzing the Predictive Capability of Tor Metrics

teor teor at riseup.net
Wed Jul 3 01:43:23 UTC 2019


Hi Mike,

I'm going to make a few suggestions for alternative analysis steps and
analysis methods. Please feel free to just use the ones that are helpful.

> On 3 Jul 2019, at 08:35, Mike Perry <mikeperry at torproject.org> wrote:
> 
> At Mozilla All Hands, we hoped to find a correlation between the amount
> of load on the Tor network and its historical performance.
> 
> Unfortunately, while there did appear to be periods of time where this
> correlation held, we discovered a major historical discontinuity in this
> correlation. We have some guesses that we need to investigate:
> https://lists.torproject.org/pipermail/tor-scaling/2019-July/000053.html
> 
> For purposes of the discussion below, let's set aside one-off causes of
> the discontinuity (things like the siv torperf's ISP changing, torperf
> upgrades, and consensus parameters), and instead focus on candidate
> independent variables that influence the time periods outside of the
> discontinuity (and may also influence the discontinuity itself).

Should we do a separate analysis before and after the discontinuity?
It looks like latency varies a lot before 2015, and throughput varies
a lot after late 2013.

As an aside, maybe we are looking for a change that started to be
deployed (or started to happen) in late 2013, and affected almost
all the network by 2015. Maybe a change in Tor 0.2.4 and later?
https://metrics.torproject.org/versions.html?start=2010-01-01&end=2019-07-03

> So, how can we tell what factors actually really contribute to the
> performance of the Tor network? Let's use statistics.
> 
> Let's start of calling Tor performance our dependent variable.
> 
> Based on the brainstorming at Mozilla, and in the meeting on Friday, we
> have a few candidate independent variables that influence performance:
>  1. Total Utilization
>  2. Bottleneck Utilization (Exit or Guard, whichever is scarce)
>  3. Total Capacity
>  4. Exit Capacity
>  5. Load Balancing

Why use "Bottleneck Utilization", but "Exit Capacity"?
Should we be using "Bottleneck Capacity" instead?

> Now, note that our performance metrics (the dependent variable) are all
> rank-comparable. We might need a human in the loop to account for the
> desired use cases/edge cases (ie lower latency is often more important
> than insanely high throughput, etc), but we can monotonically rank our
> historically performance results from better to worse, at whatever
> timescales we choose. In particular, we can look at a set of CDFs or
> boxplots of latency and throughput results, and we can say which pairs
> of latency and throughput are better for our users than others, using
> the "good" and "bad" CDF heuristics from our metrics page:
> https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#LatencyMetrics
> https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#ThroughputMetrics

Should we check the correlation between latency and throughput first?
We might actually have two dependent variables (which are dependent
on different things), rather than one. Or there could be some factors
which affect both throughput and latency, and others which only affect
one of those variables.

> Moreover, our independent variables can *also* be ranked in monotonic
> order. It is possible to rank plots of Utilization, Capacity, Relay
> Spare Capacity, and Relay Stream Capacity from "better" to "worse":
> https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#CapacityMetrics
> https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#BalancingMetrics
> 
> Once we have a monotonic rank-ordering on our dependent variable data
> points, and monotonic rank-orderings on our candidate independent
> variables' data points, we can use a statistical correlation coefficient
> to determine which network-level independent variables correlate best to
> the rank-ordered performance data. There are two statistical methods for
> determining correlation in monotonically rank-ordered data:
> https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
> https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient
> 
> Note that both methods only require monotonic ordering. It is possible
> for us to declare "ties" or simply "shrug" when trying to decide if one
> latency/throughput combination is better than another.
> 
> Of the two, Kendall's Tau is less sensitive to distances in relative
> ranking, and instead more directly measures the property that "if the
> independent variable went up/down, did the dependent variable go up/down
> at the same time?"
> 
> So, after we have done this ranking (probably manually), we compute
> Kendall's Tau for the correlation of our five independent variables, and
> see which of "Total Utilization", "Bottleneck Utilization", "Total
> Capacity", "Total Capacity", or "Load Balancing" best correlates to
> overall Tor performance in general, and for specific periods of time
> (such as across the discontinuity and during botnet and DoS attacks).
> 
> We can also investigate specific time periods where this correlation
> doesn't hold, and see what other variables are involved during those
> periods.

This analysis seems quite manual, and therefore we may need to make
decisions based on our expected model of the network.

Can we make that model explicit, before we analyse the data?

It looks you have a causal model in mind, because you're talking about
dependent and independent variables. But I'm not sure exactly what
that model is.

And I'm not sure whether the independent variables you quoted are
necessarily uncorrelated with each other, or the causes of the dependent
variables. (I'm also not sure if independence/dependence are required
for your analysis to be useful.)

For example, I think Throughput (per client) and Total Utilisation
(across the network) will be strongly correlated. And due to the way we
measure capacity, we may also "discover" extra Total Capacity as
Throughput increases.

Given these potential correlations, here is an additional way we could
analyse the data:

We do an Exploratory Factor Analysis on the 7 variables we identified,
in order to reduce them to a smaller set of underlying factors:
https://en.m.wikipedia.org/wiki/Exploratory_factor_analysis

A factor analysis could be useful, because we don't need to make as
many assumptions about the relationships between the variables in our
analysis.

The analysis will identify the most significant factor, and how it is
correlated with each of our named variables. It should also provide a
list of other factors which arise out of correlations in the data.

Ideally, we would find groups of strongly correlated independent
variables, which determine throughput, latency, or a combination of
both.

If we see a strong correlation between one or more independent
variables, and a dependent variable, then we can try varying them in
shadow, and see if we observe the same effect.

> Once this work is done, and we have a good idea what factors are most
> strongly correlated with historical Tor performance, we can start
> running shadow models where we vary these factors, and determine the
> smallest shadow model that is still able to show this relationship and
> its effects on performance metrics. This model will be our "smallest"
> baseline simulator, which we can also use as we conduct further
> performance tuning experiments.

It's been a few years since I've done a factor analysis, so let me know
if you think it would be helpful in this situation.

T

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20190703/91d635ac/attachment-0001.html>


More information about the tor-scaling mailing list