[tor-scaling] Nov 20 meeting recap: metrics analysis needed

Mon Dec 2 19:37:00 UTC 2019

Here is a belated, quick summary of the meeting on November 20th, with
some followup highlights.

We discussed Rob's relay flooding experiment, as well as some
interesting work Dennis has been doing on evaluating the overall
network-to-rendering performance of Tor Browser.

Rob's experiment is detailed at
https://lists.torproject.org/pipermail/tor-scaling/2019-August/000069.html
and on tor-relays at lists.

As a quick recap, the idea is to flood individual relays with traffic
using several circuits for 20 seconds each. This flood causes the relay
descriptor observed bandwidth to rise to the level of the total capacity
of the relay. Because relays remember and retain their peak observed
bandwidth over 5 days, the relay capacity resulting from a single 20s
spike persist for 5 days after the spike.

Rob flooded the relays one by one from 8/7 until 8/9.

The results were visible in terms of updated weights for all relays in
the network from about 8/9 until 8/13, until they began to fall again:
https://metrics.torproject.org/bandwidth.html?start=2019-08-05&end=2019-08-15

The boost "revealed" approximately 200Gb/s of "unreported" total
capacity in the network (going from 400Gb/s total to 600Gb/s total, a
50% increase).

Interestingly, performance variance increased in both circuit latency
and throughput during the 8/9 to 8/13 period:
https://metrics.torproject.org/onionperf-latencies.html?start=2019-08-01&end=2019-08-30&server=public
https://metrics.torproject.org/onionperf-throughput.html?start=2019-08-01&end=2019-8-30&server=public

I believe that indicates that we need to thoroughly investigate load
balancing characteristics of the network with this change.

In particular, I would like to look at 8 hour snapshots from 8/5 to
8/15, broken out by relay flags, of CDF-TTFB and CDF-DL from
https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics

I am also curious what our sbws and TorFlow authorities thought of this
capacity change in terms of load balancing. For this we would need to
look at 8 hour snapshots CDF-Relay-Stream-Capacity, broken out by relay
flag, using the consensus, as well as the input votes from directory
authorities connected to sbws and TorFlow:
https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics#BalancingMetrics

In both cases, it will be interesting to investigate the tails of these
CDFs to see if they contain similar individual relays (such as those
that are unbalanced due to liberal Exit policies, long term Guard
status, multi-daemon deployments, etc).

The metrics team intends to work on scripts to make this kind of
analysis easier starting in January. Since the data is retained, there
is no rush on this, but it is necessary before we could consider
deploying anything like Rob's experiment long-term.

Dennis also reminded us of similar outliers discovered in latency
analysis from over the summer:
https://lists.torproject.org/pipermail/tor-scaling/2019-July/000063.html
https://www.jottacloud.com/s/153e8e5540a75da4e98b86206cb2f761dc2

This led to discussion as to if we should be handling issues like the
latency banding problem and load (im)balancing issues by doing deep root
cause analysis to find specific bugs, or if we should use systemic
corrections to deal with these kinds of issues.

I favor the latter in most cases, especially where user impact is severe
and systemic corrections can also help identify which relays require it
(which is the case for both load balancing and latency issues), so that
any later root cause analysis can be performed more quickly.

-- 
Mike Perry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20191202/72b0d663/attachment.sig>