[tor-dev] Proposal 328: Make Relays Report When They Are Overloaded

Jim Newsome jnewsome at torproject.org
Fri Dec 11 17:25:52 UTC 2020


On 12/11/20 08:04, David Goulet wrote:

> we are talking a good 2-4
> years minimum once the feature is stable thus we have to get this out soon if
> we hope to be useful in the foreseeable future.

Right - the slow feedback cycle of deploying between deploying new
logging and trying to use it is all the more reason to plan ahead to try
to ensure the data will actually be suitable for the intended use :).
Granted, we can presumably at least *start* trying to prototype usage of
the data sooner than 2-4 years, but it'll probably still be some months
before any useful data starts arriving, right?

> Onto your next question about the hour problem. So yes, you are correct that
> the timeframe between informing the world I'm not overloaded anymore and the
> world noticing, you might get under-utilized but you might also get "just
> utilized enough".
>
> All in all, we are stuck with a network that "morphs" every hour (new
> consensus) but actually, bandwidth scanners take much longer to scan the
> entire network (in the realms of days) thus it is actually much more than an
> hour of being under-utilized :S.
>
> So there will always be that gap where we will backoff from a relay and then
> we might have backed off too much until the scanner notices it and then give
> you a bit more. But over time, as the backoff happens and the bw scanner makes
> correction, they should reach an equilibrium where the scanner finds the value
> that is just enough for you to not advertise overload anymore or in other
> words finding the sweet spot.
>
> That is likely to require time and the relay to be maxi stable as in 99%
> uptime and not too CPU/network fluctuations.
>
> But also, as we backoff from overloaded relays, we will send traffic to
> potentially under-utilized relays and so we hope that yes it will be a bumpy
> road at first but after some days/weeks, network should stabilize and we
> should actually see very few "overload-reached" after that point (except for
> operators running 1000 other things on the relay machine eating the resources
> randomly :).

Thanks for the explanation! IIUC the new consensus computed every hour
includes weights based on the latest data from the bandwidth scanners,
and an individual relay is only scanned once every x days?

In the proposal, maybe it'd be enough to briefly explain the choices of
parameters and any relevant tradeoffs - one hour for granularity, 72
hours for history, (any others?). It might also be helpful to have a
strawman example of how the data could be used in the congestion control
algorithm, with some discussion like the above ^, though I could also
see that potentially getting too far from the core of the proposal.

Btw, maybe it's worth explicitly explaining how the data *won't* be
useful for attackers? I'd assumed (possibly incorrectly) that the
history wasn't being kept at a finer granularity to avoid being able to
correlate events across relays, and from there perhaps be able to infer
something about individual circuit paths. If that sort of attack is
worth worrying about, should relays also suppress reporting events for
the current partial hour to avoid an attacker being able to probe the
metrics port to find out if an overload just happened?

> Hope this answers your question!

Very helpful, thanks!




More information about the tor-dev mailing list