[tor-dev] Bandwidth scanner: request for feedback

Mon Nov 19 12:36:28 UTC 2018

Hi,

We have deployed sbws on one bandwidth authority (longclaw).

Here's a request for additional feedback, and a progress update:

Request for Feedback: Relay Bandwidth Self-Tests

Torflow and sbws use relays' self-reported observed bandwidths for
load balancing. But relays can have really low bandwidths because
they're new, or due to random path selection.

In torflow, relays can get stuck in a low-bandwidth partition. sbws
doesn't have partitions. But in both systems, low bandwidths can
cause inaccurate or unstable load balancing.

Since torflow and sbws need accurate self-reported relay bandwidths,
some component of the Tor network needs to send enough bandwidth
through every relay.

Here are our current choices:

Tor relays can do a regular bandwidth self-test, so that their
first descriptor has an accurate bandwidth (up to some minimum). But
the current self-test is too small, and buggy.

sbws already sends bandwidth to all relays to measure them. sbws gets
accurate bandwidths for most relays within 2 weeks, but the fastest
relays can take a month to ramp up. (sbws starts measuring at the
median relay bandwidth, and can double every 5 days.)

Should we improve relay bandwidth self-tests? (#22453)
Or should we rely on sbws to create the bandwidths it needs?
What about test networks?

Should we make bandwidths grow faster in sbws?
Or is a ramp-up period of 2-5 weeks fast enough?

(We won't modify and re-deploy torflow.)

Progress Update

> On 30 Aug 2018, at 07:11, Mike Perry <mikeperry at torproject.org> wrote:
> 
> teor:
>> 
>> What happens when sbws doesn't match torflow?
>> 
>> https://trac.torproject.org/projects/tor/ticket/27339
>> 
>> We suggest this rule:
>> 
>> If an sbws deployment is within X% of an existing bandwidth
>> authority, sbws is ok. (The total consensus weights of the
>> existing bandwidth authorities are within 25% - 50% of each
>> other, see #25459.)

We have successfully used this rule to discover and fix some bugs in
sbws.

> I would like an additional criteria for when we finally replace torflow
> with sbws.
> 
> Ideally, I would like us to perform A/B experiments to ensure that our
> performance metrics do not degrade in terms of average *or* quartile
> range/performance variance. (Ie: alternate torflow results for a week vs
> sbws for a week, and repeat for a few weeks). I realize this might be
> complicated for dirauth operators, though. Can we make it easier
> somehow, so that it is easy to switch which result files they are voting
> with?

We do not have the capacity to A/B test sbws and torflow.
(As far as I understand, we don't have enough people, and we don't have
enough servers.)

> If we can't do this, at minimum, we should definitely watch the change
> in our average and quartile variance performance metrics when we first
> switch to sbws.

We deployed sbws on 1/6 bandwidth authorities, and the performance of
the network has been stable:
https://metrics.torproject.org/torperf.html?start=2018-01-21&end=2018-11-19&source=all&server=public&filesize=50kb

(The drop in performance at the start of the year was due to extra
network load.)

> Additionally, if we ever change how sbws behaves to be different than
> torflow, I would like sbws to have a well-defined load balancing
> equilibrium goal, and I would like us to not change this load balancing
> equilibrium goal unless we perform A/B testing and compare the average
> and variance of our performance metrics.
> 
> I'll explain what I mean by "load balancing equilibrium goal" below,
> when I try to explain the PID mechanism again.

sbws has adopted Torflow's load-balancing equilibrium goal.

Our priority is transitioning away from Torflow successfully.

We've deferred changes to the load-balancing goal until a later sbws
release. We may never make this change.

>> How long should sbws keep relay bandwidths?
>> 
>> https://trac.torproject.org/projects/tor/ticket/27338
>> 
>> Torflow uses the latest self-reported relay observed bandwidth
>> and bandwidth rate.
>> 
>> Torflow uses a complex feedback loop for measured bandwidths.
>> We think sbws can use a simple average or exponentially
>> decaying weighted average.
> 
> As I said in
> https://lists.torproject.org/pipermail/tor-dev/2017-December/012714.html,
> this feedback loop is disabled. I know you don't believe that the
> bandwidth auth spec is accurate, but I'm telling you it is.

Improving bandwidth measurement has been one of the most difficult
things I have done with Tor.

You're right: I don't know if the Torflow spec is accurate, because I
often struggle to find the information I need in the spec.

That's not anyone's fault: it's a difficult and complex topic. But it
does mean that I need your help to answer some questions about Torflow.

> The point of the PID control stuff was to formalize the type of load
> balancing equilibrium goal that the bandwidth auths are using, and to
> experiment with convergence on a specific target load balancing
> equilibrium point (where that target equilibrium point is "all relays
> have the same spare capacity for one additional client stream"). The
> problem was that when you only use this criteria, faster relays run out
> of CPU, memory, or sockets before this criteria was satisfied for them.
> Hence all of the circuit failure reason statistics in the code base (to
> try to back off on PID control if we hit a different limiting factor
> other than bandwidth).
> 
> ...
> 
> I'm glad that we are exploring load balancing again, and with a modern,
> simpler, and well-tested code base. That's all good. But as you make
> choices about how to load balance, please have a specific goal as to
> what target load balancing equilibrium point you're actually going for.

sbws has adopted Torflow's goals.

>> How should we scale sbws consensus weights?
>> 
>> https://trac.torproject.org/projects/tor/ticket/27340
>> 
>> If sbws' total consensus weight is different to torflow's total
>> consensus weight, how should we scale sbws?
>> 
>> (The weights might differ because the measurement method is
>> different, or because scanners and servers are in different
>> locations.)
>> 
>> In the bandwidth file spec, we suggest linear scaling.
> 
> This seems reasonable.

Unfortunately, linear scaling did not work.

sbws now uses Torflow's scaling method, with relay observed bandwidths.

> ...
> 
> I believe quite strongly that even if the Tor network gets faster on
> average, if this comes at the cost of increased performance variance,
> user experience and perceived speed of Tor will be much worse. There's
> nothing more annoying than a system that is *usually* fast enough to do
> what you need it to do, but fails to be fast enough for that activity at
> unpredictable times.

I agree. And I'm usually using Tor in high-latency locations, so I see
this variance every day.

>> How should we round sbws consensus weights?
>> 
>> https://trac.torproject.org/projects/tor/ticket/27337
>> 
>> Torflow currently rounds to 3 significant figures (which is a maximum
>> of 0.5%). But I suggest 2 significant figures for sbws (or max 5%),
>> because:
>> - tor has a daily usage cycle that varies by 10% - 20%
>> - existing bandwidth authorities vary by 25% - 50%
>> 
>> Proposal 276 contains a slightly more complicated rounding algorithm,
>> which we may want to implement in sbws or in tor:
>> 
>> https://gitweb.torproject.org/torspec.git/tree/proposals/276-lower-bw-granularity.txt
> 
> If we can measure relays frequently enough such that we can accurately
> report the effects of Tor's daily usage cycle and adjust our weights
> accordingly, then I think that retaining the ability to represent this
> variance is worth the overhead.
> 
> Again, this comes back to my belief that performance variance is
> actually the major performance problem facing Tor right now.
> 
> On the other hand, if we cannot measure accurately or often enough for
> this to matter, then it doesn't matter.

I don't believe sbws can measure relays fast enough for it to matter.

sbws was rounding to the nearest 1000 kilobytes on 1/6 authorities, with
no discernible performance on the network. We've fixed this bug, and
sbws will now round to 2 significant figures. (We haven't implemented the
extra last-digit rounding in prop276.)

If sbws can measure fast enough in future, we can modify it to report more
accurate bandwidths.

> But a successor to sbws might, if we can manage to build one sooner than
> a decade from now, so it would be wise not to bake this sig fig limit
> into our actual consensus format.

Thanks, we won't modify tor to round bandwidths: that responsibility
belongs in the bandwidth measurement code.

But tor's consensus diffs and compression benefit from rounded relay
bandwidths, so any performance gain needs to be measured against an
increase in consensus download sizes.

>> Does sbws need a maximum consensus weight fraction?
>> 
>> https://trac.torproject.org/projects/tor/ticket/27336
>> 
>> Torflow uses 5%, but I suggest 1%, because the largest relay right
>> now is only 0.5%.
> 
> Sounds reasonable.
> 
> If we ever get working multi-core crypto+networking, this number will
> change, though.

We went with 5%, to match Torflow.

T

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20181119/54c5a5d8/attachment.sig>