<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Hi Mike,<div><br></div><div>I'm going to make a few suggestions for alternative analysis steps and</div><div>analysis methods. Please feel free to just use the ones that are helpful.</div><div><br><div dir="ltr">On 3 Jul 2019, at 08:35, Mike Perry <<a href="mailto:mikeperry@torproject.org">mikeperry@torproject.org</a>> wrote:<br><br></div><blockquote type="cite"><div dir="ltr"><span>At Mozilla All Hands, we hoped to find a correlation between the amount</span><br><span>of load on the Tor network and its historical performance.</span><br><span></span><br><span>Unfortunately, while there did appear to be periods of time where this</span><br><span>correlation held, we discovered a major historical discontinuity in this</span><br><span>correlation. We have some guesses that we need to investigate:</span><br><span><a href="https://lists.torproject.org/pipermail/tor-scaling/2019-July/000053.html">https://lists.torproject.org/pipermail/tor-scaling/2019-July/000053.html</a></span><br><span></span><br><span>For purposes of the discussion below, let's set aside one-off causes of</span><br><span>the discontinuity (things like the siv torperf's ISP changing, torperf</span><br><span>upgrades, and consensus parameters), and instead focus on candidate</span><br><span>independent variables that influence the time periods outside of the</span><br><span>discontinuity (and may also influence the discontinuity itself).</span><br></div></blockquote><div><br></div><div>Should we do a separate analysis before and after the discontinuity?</div><div>It looks like latency varies a lot before 2015, and throughput varies</div><div>a lot after late 2013.</div><div><br></div><div>As an aside, maybe we are looking for a change that started to be</div><div>deployed (or started to happen) in late 2013, and affected almost</div><div>all the network by 2015. Maybe a change in Tor 0.2.4 and later?</div><div><a href="https://metrics.torproject.org/versions.html?start=2010-01-01&end=2019-07-03">https://metrics.torproject.org/versions.html?start=2010-01-01&end=2019-07-03</a></div><br><blockquote type="cite"><div dir="ltr"><span>So, how can we tell what factors actually really contribute to the</span><br><span>performance of the Tor network? Let's use statistics.</span><br><span></span><br><span>Let's start of calling Tor performance our dependent variable.</span><br><span></span><br><span>Based on the brainstorming at Mozilla, and in the meeting on Friday, we</span><br><span>have a few candidate independent variables that influence performance:</span><br><span>  1. Total Utilization</span><br><span>  2. Bottleneck Utilization (Exit or Guard, whichever is scarce)</span><br><span>  3. Total Capacity</span><br><span>  4. Exit Capacity</span><br><span>  5. Load Balancing</span><br></div></blockquote><div><br></div><div>Why use "<span style="background-color: rgba(255, 255, 255, 0);">Bottleneck Utilization", but "Exit Capacity"?</span></div><div><span style="background-color: rgba(255, 255, 255, 0);">Should we be using "Bottleneck Capacity" instead?</span></div><br><blockquote type="cite"><div dir="ltr"><span>Now, note that our performance metrics (the dependent variable) are all</span><br><span>rank-comparable. We might need a human in the loop to account for the</span><br><span>desired use cases/edge cases (ie lower latency is often more important</span><br><span>than insanely high throughput, etc), but we can monotonically rank our</span><br><span>historically performance results from better to worse, at whatever</span><br><span>timescales we choose. In particular, we can look at a set of CDFs or</span><br><span>boxplots of latency and throughput results, and we can say which pairs</span><br><span>of latency and throughput are better for our users than others, using</span><br><span>the "good" and "bad" CDF heuristics from our metrics page:</span><br><span><a href="https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#LatencyMetrics">https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#LatencyMetrics</a></span><br><span><a href="https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#ThroughputMetrics">https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#ThroughputMetrics</a></span><br></div></blockquote><div><br></div><div>Should we check the correlation between latency and throughput first?</div><div>We might actually have two dependent variables (which are dependent</div><div>on different things), rather than one. Or there could be some factors</div><div>which affect both throughput and latency, and others which only affect</div><div>one of those variables.</div><br><blockquote type="cite"><div dir="ltr"><span>Moreover, our independent variables can *also* be ranked in monotonic</span><br><span>order. It is possible to rank plots of Utilization, Capacity, Relay</span><br><span>Spare Capacity, and Relay Stream Capacity from "better" to "worse":</span><br><span><a href="https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#CapacityMetrics">https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#CapacityMetrics</a></span><br><span><a href="https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#BalancingMetrics">https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#BalancingMetrics</a></span><br><span></span><br><span>Once we have a monotonic rank-ordering on our dependent variable data</span><br><span>points, and monotonic rank-orderings on our candidate independent</span><br><span>variables' data points, we can use a statistical correlation coefficient</span><br><span>to determine which network-level independent variables correlate best to</span><br><span>the rank-ordered performance data. There are two statistical methods for</span><br><span>determining correlation in monotonically rank-ordered data:</span><br><span><a href="https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient">https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient</a></span><br><span><a href="https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient">https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient</a></span><br><span></span><br><span>Note that both methods only require monotonic ordering. It is possible</span><br><span>for us to declare "ties" or simply "shrug" when trying to decide if one</span><br><span>latency/throughput combination is better than another.</span><br><span></span><br><span>Of the two, Kendall's Tau is less sensitive to distances in relative</span><br><span>ranking, and instead more directly measures the property that "if the</span><br><span>independent variable went up/down, did the dependent variable go up/down</span><br><span>at the same time?"</span><br><span></span><br><span>So, after we have done this ranking (probably manually), we compute</span><br><span>Kendall's Tau for the correlation of our five independent variables, and</span><br><span>see which of "Total Utilization", "Bottleneck Utilization", "Total</span><br><span>Capacity", "Total Capacity", or "Load Balancing" </span>best correlates to</div></blockquote><blockquote type="cite"><div dir="ltr"><span>overall Tor performance in general, and for specific periods of time</span><br><span>(such as across the discontinuity and during botnet and DoS attacks).</span><br><span></span><br><span>We can also investigate specific time periods where this correlation</span><br><span>doesn't hold, and see what other variables are involved during those</span><br><span>periods.</span><br></div></blockquote><div><br></div><div>This analysis seems quite manual, and therefore we may need to make</div><div>decisions based on our expected model of the network.</div><div><br></div><div>Can we make that model explicit, before we analyse the data?</div><div><br></div><div>It looks you have a causal model in mind, because you're talking about</div><div>dependent and independent variables. But I'm not sure exactly what</div><div>that model is.</div><div><br></div><div>And I'm not sure whether the independent variables you quoted are</div><div>necessarily uncorrelated with each other, or the causes of the dependent</div><div>variables. (I'm also not sure if independence/dependence are required</div><div>for your analysis to be useful.)</div><div><br></div><div>For example, I think Throughput (per client) and Total Utilisation</div><div>(across the network) will be strongly correlated. And due to the way we</div><div>measure capacity, we may also "discover" extra Total Capacity as</div><div>Throughput increases.</div><div><br></div><div>Given these potential correlations, here is an additional way we could</div><div>analyse the data:</div><div><br></div><div>We do an Exploratory Factor Analysis on the 7 variables we identified,</div><div>in order to reduce them to a smaller set of underlying factors:</div><div><a href="https://en.m.wikipedia.org/wiki/Exploratory_factor_analysis">https://en.m.wikipedia.org/wiki/Exploratory_factor_analysis</a></div><div><br></div><div><div><span style="background-color: rgba(255, 255, 255, 0);">A factor analysis could be useful, because we don't need to make as</span></div><div><span style="background-color: rgba(255, 255, 255, 0);">many </span><span style="background-color: rgba(255, 255, 255, 0);">assumptions about the relationships between the variables in our</span></div><div><span style="background-color: rgba(255, 255, 255, 0);">analysis.</span></div></div><div><br></div><div>The analysis will identify the most significant factor, and how it is</div><div>correlated with each of our named variables. It should also provide a</div><div>list of other factors which arise out of correlations in the data.</div><div><br></div><div>Ideally, we would find groups of strongly correlated independent</div><div>variables, which determine throughput, latency, or a combination of</div><div>both.</div><div><br></div><div>If we see a strong correlation between one or more independent</div><div>variables, and a dependent variable, then we can try varying them in</div><div>shadow, and see if we observe the same effect.</div><br><blockquote type="cite"><div dir="ltr"><span>Once this work is done, and we have a good idea what factors are most</span><br><span>strongly correlated with historical Tor performance, we can start</span><br><span>running shadow models where we vary these factors, and determine the</span><br><span>smallest shadow model that is still able to show this relationship and</span><br><span>its effects on performance metrics. This model will be our "smallest"</span><br><span>baseline simulator, which we can also use as we conduct further</span><br><span>performance tuning experiments.</span><br></div></blockquote><div><br></div>It's been a few years since I've done a factor analysis, so let me know</div><div>if you think it would be helpful in this situation.</div><div><br></div><div>T<br><div><br></div></div></body></html>