[tor-dev] Proposal 328: Make Relays Report When They Are Overloaded

David Goulet dgoulet at torproject.org
Fri Dec 11 14:04:18 UTC 2020

On 09 Dec (11:36:09), Jim Newsome wrote:
> On 12/7/20 14:06, David Goulet wrote:
> > Greetings,
> >
> > Attached is a proposal from Mike Perry and I. Merge requsest is here:
> >
> > https://gitlab.torproject.org/tpo/core/torspec/-/merge_requests/22
> Disclaimer - As someone not very familiar with how tor load balancing
> works today, I might not be the target audience for this proposal :)
> Maybe it's putting the cart before the horse, but it might be helpful to
> have a more concrete proposal for how this data will be used, which in
> turn will help evaluate whether this is the right data to collect.
> e.g. naively I might assume the idea is to have some kind of exponential
> backoff for overloaded relays; but since the proposal is for the
> overload events to be recorded at hour-granularity, would that result in
> a relay getting overloaded at the top of every hour, and then
> under-utilized for the rest of the hour?

Right so there are currently ideas circulating around on how to use that data

The likely short-term proposal is sbws (bw scanner) that will use that as a
simple signal to backoff on the amount of bw given, as you stated.

Thus your question is right on the nail there about "why we have this proposal
without a concrete proposal on how to use it" :).

The answer I can give you is that we've thought on how for a relay to tell the
world, in a safe way, that it is suffocating. There are few places in the tor
we can actually notice (at the moment) performance problems.

And so we took them all (more might come over time), and mashed them into a
single line "overload reached". And we did that before anything else because
for the network to migrate to support that feature, we are talking a good 2-4
years minimum once the feature is stable thus we have to get this out soon if
we hope to be useful in the foreseeable future.

Onto your next question about the hour problem. So yes, you are correct that
the timeframe between informing the world I'm not overloaded anymore and the
world noticing, you might get under-utilized but you might also get "just
utilized enough".

All in all, we are stuck with a network that "morphs" every hour (new
consensus) but actually, bandwidth scanners take much longer to scan the
entire network (in the realms of days) thus it is actually much more than an
hour of being under-utilized :S.

So there will always be that gap where we will backoff from a relay and then
we might have backed off too much until the scanner notices it and then give
you a bit more. But over time, as the backoff happens and the bw scanner makes
correction, they should reach an equilibrium where the scanner finds the value
that is just enough for you to not advertise overload anymore or in other
words finding the sweet spot.

That is likely to require time and the relay to be maxi stable as in 99%
uptime and not too CPU/network fluctuations.

But also, as we backoff from overloaded relays, we will send traffic to
potentially under-utilized relays and so we hope that yes it will be a bumpy
road at first but after some days/weeks, network should stabilize and we
should actually see very few "overload-reached" after that point (except for
operators running 1000 other things on the relay machine eating the resources
randomly :).

This does highlight also the massive importance of stable relays on the
network so its load balancing can adjust and converge to an equilibrium
without having to re-adjust because 1000 relays on pi4 went down for the night

Hope this answers your question!


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20201211/9af196f0/attachment.sig>

More information about the tor-dev mailing list